R0-R1C3  991 
UNCLASSIFIED 


A  CONTEXTUAL  POSTPROCESSING  EXPERT  SVSTEN  FOR  ENGLISH 
SENTENCE  READING  NACHINES<U>  AIR  FORCE  INST  OF  TECH 
HRIOHT-PATTERSON  AFB  ON  SCHOOL  OF  ENOI. . 

D  V  PACIORKOUSKI  DEC  89  AFIT/'GE/'ENG/89D-31  F/G  9/2 


1/2 


•n.iiM.voa 


A163  951 


AFIT/GE/ENG/85D-31 


A  CONTEXTOAL  POSTPROCESSING  EXPERT  SYSTEM 
FOR  ENGLISH  SENTENCE  READING  MACHINES 


THESIS 


Presented  to  the  Faculty  of  the  School  of  Engineering 
of  the  Air  Force  Institute  of  Technology 
Air  University 

In  Partial  Fulfillment  of  the 
Requirements  for  the  Degree  of 
Master  of  Science  in  Electrical  Engineering 


DAVID  V.  PACIORKOWSKI,  BSEE 
Captain,  USAF 


December  1985 


Approved  for  public  release;  distribution  unlimited. 


PREFACE 


Under  Che  current  state  of  technology,  most  optical  character 
recognizers  (OCRs)  could  be  described  as  machines  that  read  with 
difficulty.  A  few  OCRs  are  capable  of  accurate  character  recognition 
when  Che  Input  text  Is  highly  constrained  for  Image  quality,  typeface, 
and  possibly,  content.  The  purpose  of  this  study  was  to  design  a 
postprocessor  chat  could  enhance  Che  performance  of  OCRs  which  are  used 
to  read  documents  of  a  more  realistic  variety  (e.g.  no  particular  font 
or  subject  presumed,  less  than  optimum  Image  quality). 

This  postprocessor  was  developed  using  knowledge-based  programming 
techniques  and  employing  multiple  knowledge  sources  concerning  the 
structure  and  content  of  English  words  and  sentences.  The  uniqueness  of 
my  approach  was  Che  emphasis  on  removing  Che  constraints  traditionally 
placed  on  the  vocabulary  and/or  subject  domain  of  the  input  text  by 
other  postprocessing  systems.  Although  there  are  numerous  other 
problems  In  OCR  technology  that  must  also  be  solved,  this  research  Is  a 
very  significant  step  coward  Che  production  of  an  accurate,  and 
efficient  automated  reading  machine. 

I  would  like  to  thank  ray  advisor.  Dr.  Mathew  Kabrlsky,  for  the 
encouragement  and  guidance  that  he  provided,  and  for  the  autonomy  of 
approach  that  he  permitted  of  me.  Most  of  all,  I  would  like  to  express 
a  sincere  appreciation  to  my  wife,  Tammy,  for  her  patience,  her 
understanding,  and  her  support  given  to  me  throughout  this  effort. 


David  Vincent  Paclorkowskl 
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ABSTRACT 


Knowledge-based  programming  techniques  are  used  In  an  expert  system 
to  reduce  uncertainty  for  optical  character  recognition  by  combining 
evidence  from  several  diverse  knowledge  sources  cooperating  In  an 
hierarchical  configuration.  The  postprocessor  development  focused  on  a 
system  for  generic  text  Input  that  Is  not  constrained  to  a  fixed 
vocabulary  or  particular  subject  domain.  The  key  element  to  the 
system's  effectiveness  Is  a  spell  checking  algorithm  that  Is  not 
operationally  bounded  by  an  enumerated  lexicon  or  biased  by  statistical 
sampling.  The  postprocessor  system  Is  also  design  to  Interface  almost 
any  type  of  OCR  front-end  methodology.  J /i  * 


A  CONTEXTDAL  POSTPROCESSING  EXPERT  SYSTEM 
FOR  ENGLISH  SENTENCE  READING  MACHINES 


I .  Introduction 


Background . 

The  Increasing  use  of  automatic  data  processing  and  the  widespread 
use  of  computers  to  solve  complex  problems  have  compelled  the  Importance 
of  evolving  efficient  means  to  enter  information  into  these  machines. 
At  present,  almost  all  textual  data  to  be  used  by  computers  is  Initially 
translated  into  a  machine  recognizable  form  by  typing  on  a  keyboard.  An 
alternative  Interface  to  these  computers,  capable  of  accepting  input  by 
human  speech  or  by  optically  scanning  printed  documents,  would  ease  the 
process  of  initial  data  entry  and  increase  the  overall  efficiency  of 
automated  information  processing. 

One  of  the  Interface  alternatives  currently  being  researched  is  the 
optical  character  reader  (OCR).  The  OCR  is  an  image  processing  device 
used  to  identify  letters,  other  typed  characters,  and  in  some  Instances 
hand-written  text.  The  image  processing  performed  by  an  OCR  can  be 
generalized  into  three  basic  steps: 

1.  Digitization  -  conversion  of  the  optical  image  into  a  discrete 
mapping  of  picture  elements  (pixels),  each  quantized  on  a  grey 
scale  to  represent  signal  intensity. 


2.  Preprocessing  -  locating  Individual  characters  or  character 

features;  nay  also  Include  image  enhancement  such  as  conversion 
of  the  digitized  image  into  binary  grey  scale  format  to 
facilitate  the  separation  of  text  components  such  as  lines, 
words,  and  characters. 

3.  Identification  -  picking  the  best  match  of  the  character  image 

to  a  known  character;  this  may  be  accomplished  by  finding  the 
best  alignment  between  the  image  and  a  character  template,  or 
by  conversion  of  a  set  of  feature  measurements  of  an  individual 
character  into  an  n-dlmensional  vector,  where  n  is  the  number 
of  individual  features  describing  the  composition  of  the 
character,  and  matching  the  vector  representing  each  optically 
scanned  character  with  the  closest  prototype  vector  that 
represents  a  known  character. 

There  are  numerous  optical  character  readers  currently  being 
produced  for  specialized  purposes  such  as  bank  check  reading,  postal  zip 
code  reading,  and  even  some  page  readers  for  text  entry.  But  reading 
machine  technology  is  still  in  the  early  stages  of  development.  Those 
machines  which  are  commercially  available  at  present  perform  under 
significant  restrictions  or  with  limited  accuracy.  One  of  the  most 
limiting  restrictions  for  current  machines  is  their  dependence  on 
specific  letter  spacing,  specific  letter  sizes,  and  specific  letter 
fonts  for  the  printed  material  used  as  input.  As  long  as  the  letters  in 
the  text  to  be  read  are  well  separated,  in  known  positions,  and  belong 
to  a  particular  type  set,  an  OCR  employing  basic  template  matching 
techniques  can  successfully  read  the  material.  However,  printed 


material  comes  In  countless  variations  of  size  and  font.  The  spacing 
between  letters  is  also  often  not  uniform  or  even  nonexistent.  A 


reading  machine  which  could  operate  accurately  without  the  strict 
restrictions  that  current  reading  systems  place  on  printing  style  and 
format  would  have  widespread  applications  and  would  represent  a 
significant  improvement  In  reading  machine  technology. 

The  Increasing  merger  between  the  pattern  recognition  research  and 
artificial  Intelligence  techniques  has  opened  new  avenues  toward  the 
development  of  accurate,  automated,  multifont  reading  systems. 
Character  recognition  by  simple  template  matching  cannot  be  successful 
In  overcoming  the  obstacles  of  multifont  reading.  The  computational 
resources  needed  to  store  and  to  search  through  all  typeset  variations 
of  all  characters  would  make  a  template  system  Impractical  to  produce 
1^  and  use.  If  not  Impossible  to  develop.  Instead,  the  features  used  to 

Identify  characters  must  be  chosen  to  form  Intrinsic  definitions  of  each 
character.  Besides  some  of  the  new  approaches  to  improve  the 
Identification  abilities  of  optical  character  readers  with  unique 
feature  extraction,  knowledge-based  programming  and  mathematical 
algorithms  are  being  applied  as  postprocessors  to  the  image  recognition 
process  to  improve  the  overall  performance  of  text  reading  systems. 

A  system  considered  to  be  state-of-the-art  for  multifont  reading, 
the  Kurzwell  4000  Intelligent  Scanning  System,  uses  the  general  shapes 
of  letter  components  such  as  loops,  horizontal  lines,  and  concavity  to 


recognize  characters.  The  use  of  shape  features  by  this  system  was 
chosen  to  obtain  multifont  reading  capabilities  and  the  ability  to  read 
proportionally  spaced  type.  The  concept  behind  this  design  choice  Is 


that  each  letter  maintains  a  basic  form,  or  one  of  a  few  basic  forms, 
for  all  sizes  and  styles  of  print  found  in  text  (1:111). 

Promising  results  in  pattern  recognition  research,  conducted  at 
AFIT,  for  the  development  of  a  general  text  reading  machine  makes  use  of 
two-dimensional  discrete  Fourier  transforms  (2D-DFT)  in  the 
Identification  stage  of  optical  character  recognition.  This  technique 
has  been  successful  in  Identifying  isolated  letters  and  Isolated  whole 
words  with  considerable  Independence  toward  the  size  print  and  toward 
the  font  used  in  the  text  (2;  3;  4).  The  2D-DFT  was  proposed  by 
Kabrlsky  as  a  mathematical  model  of  the  human  visual  information 
processing  system,  to  suggest  how  optical  images  are  represented  and 
processed  in  the  brain.  Subsequent  research  has  given  Increased 
credibility  that  the  low  frequency  spatially  filtered  2D-DFT  is  a 
Gestalt  representation  of  an  image.  Therefore,  the  low  frequency 
filtered  2D-DFT  of  a  character  image  represents  the  essence  of  the 
character's  identity  independent  of  its  specific  form.  A  letter  "B”  has 
"B-ness"  regardless  of  whether  it  was  printed  in  a  plain  gothic  font  or 
in  a  highly  decorative  font  with  swashes  and  seriffs  (2:2;  3:4-5;  4). 

The  intrinsic  letter  shapes  used  by  the  Kurzweil  4000  and  the  use 
of  the  low  frequency  2D-DFT  as  a  Gestalt  representation  are  two  examples 
of  how  chosing  the  proper  descriptive  features  can  increase  reading 
machine  capabilities  toward  efficient  automated  multifont  text 
processing.  However,  in  the  application  of  any  character  Identification 
technique  to  printed  pages  of  text,  several  problems  are  anticipated  to 
degrade  recognition  performance.  Lack  of  control  over  the  quality  of 
the  print  is  one  aspect  of  the  reading  environment  which  will  create 


uncertainty  in  the  letter  identification  process.  All  letters  may  not 
be  well  formed  and  of  uniform  contrast.  Spaces  in  the  ink,  or 


variations  from  too  much  or  too  little  ink  for  any  particular  letter 
could  cause  a  misinterpretation  by  the  reading  machine.  The  lack  of 
separation  of  letters  within  a  printed  word  is  another  factor  which 
reduces  the  reliability  of  the  automatic  reading  process.  Parts  of 
adjoining  letters  could  be  included  in  the  optical  processing  of  each 
letter.  This  can  often  cause  a  letter  to  resemble  other  letters.  One 
other  important  source  of  uncertainty  for  the  reading  process  is  noise. 
The  noise  Introduced  by  the  image  detection  and  signal  digitization 
equipments  will  also  increase  the  likelihood  of  character  recognition 
errors.  The  OCR  using  a  multifont  character  identification  process  is 
not  in  Itself  a  sufficient  solution  to  the  automated  text  reading 
problem.  Some  enhancement  is  necessary  to  Improve  its  reliability  and 
increase  its  speed  of  independent  operation. 

There  is  a  lot  of  information  contained  in  the  structure  and 
content  of  English  text  which  is  not  being  used  by  OCR  systems  to 
Improve  the  probability  of  correct  character  recognition.  English 
sentences  conform  to  a  large  but  well-defined  set  of  constructual 
arrangements  based  upon  the  part  of  speech  which  each  word  fulfills. 
There  is  also  much  Information  which  can  be  derived  from  the  meaning  of 
each  sentence  that  could  be  used  to  resolve  uncertainties  about 
individual  letters  or  words  contained  in  a  sentence.  The  sequences  of 
letters  which  form  words  in  any  language  are  constrained  by  a  limited 
inventory  of  allowable  sound  sequences  for  that  language. 


Consider  the  example  of  a  person  trying  to  read  a  document  which 
was  copied  on  a  poorly  maintained  photocopier.  Anyone  who  has  had  such 
an  opportunity  has  found  that  the  text  can  be  read  successfully  even 
when  the  print  Is  very  distorted  and  significant  portions  of  many  of  the 
letters  are  missing.  Under  such  circumstances  the  human  reading 
mechanism  uses  a  complex  and  unknown  process  to  complete  the  task.  The 
natural  redundancy  of  Information  In  the  English  language  must  help 
significantly.  But,  the  reader  Is  also  using  his  experience  and 
understanding  of  the  language  to  solve  the  problem.  The  reader  must 
generate  hypotheses  about  each  of  the  uncertain  letters  and  words. 
Eventually,  one  of  the  hypothesized  sentences  which  is  composed  of 
recognizable  words.  Is  grammatically  complete,  and  Is  consistent  In 
content  with  the  content  of  other  sentences  read  is  chosen  for  the 
solution.  A  human  reader  attempting  to  read  an  unknown,  foreign 
language  under  these  same  circumstance  would  not  be  as  successful  as  a 
reader  In  his  native  language.  The  performance  difference  comes  from 
the  understanding  of  the  word  structures,  the  word  meanings,  and  the 
entire  concept  presented  by  the  sentences. 

Problem  Statement. 

One  logical  approach  to  Improving  the  performance  of  OCRs  Is  to 
Imitate  the  human  reading  process.  An  OCR  attempting  to  read  letters 
which  appear  Incomplete  or  distorted  to  Its  recoglnition  process  Is 
similar  to  the  human  reader  attempting  to  read  text  from  a  poor  quality 


copy.  Knowledge  of  the  language  and  Information  contained  In  readable 
portions  of  the  document  must  be  used  together  to  solve  the  problem. 
Expert  systems  are  computer  programs  which  adapt  knowledge  sources  Into 
rule-based  processes  for  the  purpose  of  solving  difficult  problems.  It 
should  be  possible  to  Improve  the  performance  and  rellabllty  of  an  OCR 
by  combining  It  with  an  expert  system  for  contextual  postprocessing. 
The  expert  system  would  make  use  of  aprlorl  knowledge  about  legitimate 
letter  sequences  In  words  and  would  process  the  syntactic  and  semantic 
Information  available  In  English  sentences.  This  thesis  research  will 
explore  the  use  of  semantic,  syntactic,  and  other  word  and  sentence 
knowledge  sources  In  an  expert  system  that  will  combine  the  evidence 
available  from  multiple  sources  to  reliably  Identify  characters  In 
printed  English  text. 


Approach  and  Scope. 

The  focus  of  this  research  Is  to  devise  a  contextual  postprocessor 
that  may  be  used  to  Improve  the  accuracy  and  throughput  of  text  reading 
systems  In  general.  The  various  levels  of  expert  systems  to  be  used  by 
the  postprocessor  will  be  chosen  so  that  they  do  not  restrict  the  Image 
processing  methodology  of  the  OCR  front-end,  and  so  that  the  textual 
data  entry  process  can  be  performed  with  efficient  automation. 
Automation  efficiency  will  be  provided  through  Increased  recognition 
accuracy  and  a  reduced  need  for  operator  Intervention  for  machine 


training  and  correction. 


The  contextual  postprocessing  system  developed  for  this  thesis  will 
consist  of  a  hierarchical  control  structure  which  will  interface  with  an 


optical  character  reader  and  several  component  expert  systems.  The 
design  of  the  system  will  be  such  that  almost  any  type  of  OCR  could  be 
used  with  this  system,  Including  a  template  matching  OCR.  The  only 
requirement  of  the  OCR  is  that  Its  character  identification  methodology 
allow  it  to  produce  a  weighted  list  of  choices  for  each  character  being 
identified.  For  testing  purposes,  the  OCR  front-end  to  this  system  will 
use  the  2D-DFT  method  of  character  identification.  The  input  from  the 
OCR  will  be  offline  or  simulated. 

The  expert  systems  used  by  the  postprocessor  will  be  arranged  to 
provide  additional  evidence  concerning  the  likelihood  of  each  character 
choice  as  characters  are  combined  to  form  various  structures  from  letter 
sequences  (n-grams)  through  sentences.  The  evidence  provided  by  the 
various  knowledge  sources  will  be  used  to  either  eliminate  character, 
word,  and  sentence  hypotheses  or  to  endorse  the  plausibility  of  those 
hypotheses.  The  association  of  “probability"  measures  with  competing 
letter,  word,  and  sentence  hypotheses  will  be  used  by  the  control 
structure  to  combine  the  evidence  provided  by  the  sources  of  Information 
and  eventually  establish  a  best  choice  decision  at  the  sentence  level. 
Some  Information  will  be  extracted  from  the  sentences  that  have 
completed  processing  and  this  Information  will  be  used  toward  the 
processing  of  the  sentences  which  follow  within  the  same  body  of  text. 

The  key  expert  system  In  the  processing  hierarchy  will  be  a  low 
level  but  powerful  knowledge  source  constructed  to  provide  Information 
about  spelling  rules  for  valid  syllables  and  letter  sequences  In  the 


English  language.  This  knowledge  source  will  be  used  to  reduce  the  set 
of  solution  hypotheses  to  be  processed  through  the  other  knowledge 
sources  by  eliminating  Impossible  words.  Another  expert  system  to  be 
used  by  the  postprocessor  will  be  a  syntactic  parser.  This  knowledge 
source  will  apply  rules  of  grammar  to  eliminate  non-conforming 
sentences.  Research  for  sentence  parsing  Is  well  established;  therefore 
this  thesis  will  not  attempt  to  develop  a  parser  unique  to  this  system. 
The  parser  Interface  will  be  simulated  so  that  future  research  using 
this  postprocessing  system  can  be  modularly  adapted  to  whatever  parser 
Is  available. 

Some  use  of  semantic  Information  will  be  Incorporated  Into  the 
contextual  postprocessing  system;  however,  complete  semantic  analysis  at 
the  phrase  level,  sentence  level,  or  higher  Is  not  practical.  Existing 
approaches  for  text  understanding  systems  are  only  capable  of  operating 
within  specific  subject  domains  and  with  very  restricted  vocabularies. 
Semantic  Influence  In  the  system  will  be  restricted  to  a  process 
resembling  the  short  term  memory  function  employed  In  the  speech 
recognition  system,  SFEREXSYS  (5). 


Assumptions. 


The  reading  machine  process  developed  by  this  research  project  will 
not  operate  free  of  all  constraints.  It  Is  assumed  that  the  text  to  be 
used  as  Input  Is  composed  of  grammatically  correct  sentences.  All  words 
should  be  In  the  English  language  or  If  the  word  Is  newly  coined  It 


should  conform  to  typical  English  phonemic  structure.  The  number  of 
different  English  words  which  may  be  used  in  the  input  text  is  not 
constrained  because  the  postprocessing  system  will  not  depend  upon  an 
apriorl  listing  of  vocabulary.  The  spelling  of  words  should  be 
reasonably  correct.  Incorrect  spellings  may  be  Identified  as 
unrecognizable  words,  or  in  a  few  cases  corrected  providing  there  is 
sufficient  supporting  evidence. 

Another  Important  consideration  is  the  accuracy  of  the  OCR  front- 
end  to  this  system.  If  the  OCR  were  completely  accurate,  there  would  be 
no  need  for  this  system.  However,  the  recognition  accuracy  of  the  OCR 
front-end  will  impact  on  output  accuracy  of  the  combined  reading  system. 
Because  the  postprocessor  is  attempting  to  use  knowledge  derived  from  the 
structure  and  content  of  the  text,  some  portion  of  the  text  must  be 
recognized  accurately.  Therefore,  a  recognition  threshold  for  the  OCR 
will  be  use  to  allow  some  of  the  characters  to  be  considered  as 
correctly  recognized  without  postprocessing;  treating  all  input 
characters  with  uncertainty  would  be  computationally  prohibitive  toward 
real  time  processing.  The  postprocessor  will  have  more  information  on 
which  to  base  its  decisions,  and  thus  produce  more  accurate  decisions, 
if  the  accuracy  of  the  OCR  allows  more  characters  to  be  interpreted  as 
correctly  recognized  without  postprocessing.  This  postprocessor  is 
expected  to  Improve  the  performance  of  almost  any  OCR.  However,  the 
postprocessor  will  only  be  considering  the  top  few  choices  presented  by 
the  OCR  for  each  character  position.  In  order  to  achieve  a  performance 
improvement,  it  is  most  important  that  the  correct  choice  for  the 
character  being  read  is  in  the  top  few  choices  of  the  OCR's  output.  As 


the  number  of  characters  which  can  be  considered  correctly  recognized 
without  postprocessing  decreases  and  the  abllty  of  the  OCR  to  provide 
the  correct  letter  In  the  top  few  choice  decreases,  the  Improvement 
provided  by  the  postprocessor  will  eventually  decrease  to  a  negligible 
amount.  This  postprocessor  is  Intended  for,  although  no  limited  to, 
those  OCRs  which  are  performing  Just  a  little  bit  shy  of  the  reliability 
needed  for  efficient  automation  of  text  processing. 


General  Solution  Desi 


Resolving  Uncertainty  with  Expert  Systems. 


As  mentioned  earlier,  an  expert  system  Is  a  knowledge-based 
computer  program  designed  to  give  advice  or  solve  difficult  problems 
concerning  a  specialized  subject.  Expert  systems  have  been  developed 
for  many  applications  which  Include  medical  diagnosis,  analysis  of 
chemical  and  geological  data,  engineering  design,  and  speech 
understanding.  The  basic  components  of  such  a  system  are  a  database  of 
information,  a  set  of  inference  rules  to  reason  with,  and  a  control 
strategy  for  applying  the  rules  to  the  database  and  for  resolving 
conflicts  in  the  reasoning  process.  The  simplest  form  of  these  systems 
uses  well-defined  rules  of  mathematical  deduction  or  predicate  calculus 
to  draw  conclusions  from  consistent  statements  In  the  knowledge  base. 
But  there  are  many  applications  where  the  data  Is  not  readily 
manipulated  by  the  deterministic  rules  of  predicate  logic. 

Alternate  techniques  must  be  used  when  the  data  may  be  considered 
unreliable.  Situations  producing  unreliable  data  Include  those  when  the 
evidence  Is  questionable  or  unavailable,  those  when  data  descriptions 
Include  relative  measures  that  are  not  precisely  quantifiable,  those 
when  the  conclusions  that  can  be  proposed  are  not  absolute,  or  those 
when  assumptions  must  be  made  based  upon  llkellness  or  lack  of  contrary 


evidence.  Character  recognition  fits  Into  several  of  these  categories. 


The  problems  of  print  discontinuity  and  extraneous  markings  In  the  input 
text  are  two  obvious  examples  of  questionable  evidence  being  used  in  the 
character  recognition  process.  The  continuing  efforts  to  find  an 
accurate  multifont  feature  set  to  be  used  for  character  recognition 
demonstrate  that  the  image  data  is  not  precisely  quantifiable  with 
respect  to  the  problem.  The  use  of  thresholds  by  many  OCRs  for  a 
recognltion/non-recognitlon  decision  in  the  identification  process  shows 
that  the  processing  conclusions  are  often  non-absolute.  Some  of  the 
general  approaches  used  by  expert  systems  to  cope  with  uncertainty  or 
unreliability  in  the  data  are  nonmonotonic  logic,  probabilistic 
modeling,  evidential  reasoning,  and  fuzzy  logic  (6:175;  7:93-96). 

An  expert  system  employing  nonmonotonic  logic  allows  hypothetical 
statements  to  be  added  and  deleted  from  the  database  in  response  to  the 
admission  of  new  knowledge.  This  is  accomplished  through  such  methods 
known  as  default  reasoning  and  dependency-directed  backtracking. 
Default  reasoning  makes  assumptions  about  relevant  conditions  in  the 
absence  of  any  contradictory  information;  the  most  probable  choice 
(default  value)  is  assumed  if  no  data  is  provided.  These  default 
assumptions  and  all  further  inferences  contingent  on  the  default 
assumptions  are  subject  to  revision  as  new  evidence  is  supplied  to  the 
system  (6:176;  7:73-75).  When  a  new  element  of  information  reveals  an 
inconsistency,  assumption  and  inference  statements  which  were  added  to 
the  database  must  be  traced  back  to  the  original  source  of  error  and 
withdrawn.  A  chronological  backtracking  of  the  reasoning  process  would 
be  inefficient  and  wasteful  because  assumptions  and  inferences  not 
dependent  on  the  assumption  in  error  would  be  erased  from  memory  during 


the  backtrack  and  probably  regenerated  in  the  near  future.  By  keeping 
track  of  the  supporting  assumptions  and  Inference  rules  which  Justify 
each  statement  added  to  the  database,  dependency-directed  backtracking 
withdraws  only  the  assumptions  now  believed  to  be  in  error  and  any 
Inferences  derived  from  those  assumptions  (6:179;  7:75). 

Nonmonotonic  reasoning  systems  are  useful  In  several  circumstances. 
When  the  lack  of  complete  Information  would  stop  the  problem  solving 
process,  default  reasoning  maintains  the  momentum.  When  the  problem 
solution  requires  the  generation  of  temporary  assumptions  and  partial 
solutions,  dependency-directed  backtracking  provides  an  efficient  means 
of  revising  small  components  of  the  total  solution.  When  the 
Information  is  provided  to  the  database  over  an  extended  period  of  time 
or  when  the  Information  in  the  database  changes  states  over  time,  a 
nonmonotonic  system  Is  an  effective  method  of  maintaining  current  theory 
for  a  problem  solution  (6:179).  However,  the  nonmonotonic  reasoning 
approach  does  not  suit  the  postprocessing  application.  Although  this 
method  may  be  capable  of  improving  the  accuracy  of  character  reading 
machines,  the  amount  of  backtracking  computation  involved  would  not  be 
efficient  In  automating  the  reading  process.  This  method  of  continuous 
second  guessing  does  not  fit  well  Into  the  human  model  of  character 
recognition. 

Fuzzy  logic  is  based  upon  a  specialized  mathematical  set  theory  and 
is  appropriate  for  interpreting  imprecise  quantifications  of  data 
(7;95).  In  fuzzy  logic,  descriptive  modifiers  such  as  large,  small,  and 
many  are  characterized  by  specific  ranges  of  values,  each  associated 
with  a  probabilistic  measure  (e.g.  the  value  Is  between  10  and  1,000 


with  a  posslbllty  of  15Z).  The  use  of  fuzzy  logic  In  expert  systems  has 
not  been  as  widespread  as  some  of  the  other  reasoning  methods.  This  Is 
largely  due  to  the  complexity  of  mapping  the  mathematical  theory  Into 
the  domain  of  expert  Interpretations  for  specialized  problems  (8). 

Probabilistic  modeling  Is  the  application  of  statistical  data  In 
quantifying  the  reliability  of  Input  data  and  the  likelihood  of  the 
conclusions.  In  systems  which  process  high  volumes  of  sensor  data, 
statistical  calculations  such  as  standard  deviation  and  correlation  are 
used  to  characterize  the  data.  When  data  from  multiple  sources  are 
combined,  joint  probability  functions  are  used  to  calculate  the 
conditional  probability  of  a  concluded  event  (9).  Probabilistic 
modeling  Is  a  mathematically  supported  method  of  dealing  rigorously  with 
data  uncertainty.  However,  there  are  many  drawbacks  to  Its  widespread 
application.  The  large  amount  of  data  needed  to  determine  the 
probability  functions  Is  not  always  available.  The  large  number  of 
Interactions  between  observations  Is  often  so  difficult  to  understand 
that  conditional  Independence  of  sources  Is  assumed  without  thorough 
justification.  The  constraint  that  the  sum  of  the  probabilities  of 
possible  events  must  equal  one,  makes  It  difficult  to  modify  the 
knowledge  base  and  rule  set.  The  accuracy  of  the  probabilistic 
conclusions  Is  dependent  upon  the  set  of  hypothetical  outcomes  being 
complete  and  mathematically  disjoint  (6:192-193;  7:94).  Of  these  four 
major  approaches  being  used  In  expert  systems  to  reason  toward  a 
solution  In  the  presence  of  unreliable  or  Inconsistent  Input  data, 
probabilistic  modeling  seems  to  be  the  approach  used  most  often  In 
contextual  postprocessing  research. 


Probabilistic  postprocessors  use  statistical  data  concerning 
individual  letter  occurence  frequencies  or  letter  sequence  (n-gram) 
frequencies  to  estimate  the  correctness  of  the  text  interpretted  by  the 
OCR  (10,  11).  This  approach  has  had  some  success  but  the  success  was 
dependent  upon  a  close  match  between  the  source  of  statistical  data  and 
the  text  being  processed.  Shannon,  who  was  among  the  first  people  to 
recognize  the  dependencies  between  letters  in  natural  languages, 
suggested  that  parameters  Involved  with  predicting  strings  of  English 
text  are  dependent  upon  the  particular  text  involved  (12).  A  variation 
on  the  probabilistic  approach  which  combines  subjective  estimates  of 
probabilities  and  a  means  of  combining  positive  and  negative  support  for 
a  proposition  is  called  evidential  reasoning. 

Evidential  reasoning  uses  a  confidence  scale  ranging  from  +1  to  -1, 
distinguishing  between  supporting  and  refuting  information  (13:206;  14). 
Conclusions  are  assigned  a  certainty  value  that  results  from  the  product 
of  the  probabilistic  rating  of  the  information  used  and  the  certainty 
factor  associated  with  the  Inference  rule  used.  The  strong  point  of 
this  methodology  comes  from  a  complicated,  but  well-formed  set  of  rules 
for  combining  weighted  evidence  from  multiple  sources.  These  rules  are 
referred  to  as  the  Dempster-Shaf er  theory  (14;  15).  Basically,  the 
combination  of  two  observations  results  in  a  measure  of  belief  that  is 
equal  to  the  measure  of  belief  associated  with  one  of  the  observations 
plus  an  Incremental  portion  of  the  belief  associated  with  the  second 
observation.  Thus,  several  items  of  information  with  relatively  small 
individual  probabilistic  weightings  can  combine  to  increase  the 
confidence  value  of  an  hypothesis  (6:  194). 


Many  of  the  more  successful  expert  systems,  such  as  MYCIN  and 


PROSPECTOR,  are  using  some  variation  of  evidential  reasoning  to 
Integrate  the  knowledge  contributions  of  many  diverse  information 
sources.  In  these  systems  some  of  the  information  sources  used  to 
produce  a  conclusion  are  naturally  quantifiable  such  as  sensor  data  and 
the  results  of  diagnostic  tests.  But  other  sources  of  Infornation  are 
simply  the  subjective  opinions  of  specialists  biased  by  their 
experiences.  The  key  advantage  of  evidential  reasoning  is  that  it  is  a 
formal  means  of  combining  knowledge  from  a  wide  range  of  sources.  The 
model  for  Che  human  reading  process,  as  introduced  in  the  first  chapter, 
uses  a  wide  range  of  knowledge  sources  in  producing  a  solution.  An 
expert  system  designed  to  imitate  chat  model  by  combining  information 
and  judgements  about  text  hypothesis  at  various  levels  from  the  image 
data  of  single  characters  to  the  grammar  conformity  of  complete 
sentences  will  require  Che  flexablllty  of  evidential  reasoning  approach. 


Hierarchical  Processing. 

Many  expert  systems  use  several  different  sources  of  information, 
especially  syntactic  and  semantic  knowledge,  to  select  a  confident 
interpretation  of  optical  or  speech  signals.  These  systems  typically 
conform  to  one  of  two  basic  architectures.  One  architecture  is  the 
modular  approach  in  which  discrete  knowledge  sources  are  arranged  in  an 
hierarchical  order  and  are  scheduled  to  process  the  signal  data 
Independently.  The  other  architecture  is  the  compiled  knowledge 
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approach  in  which  all  the  knowledge  about  the  particular  problem  domain 
is  integrated  into  a  large  network  and  processing  is  similar  to 
searching  for  an  optimum  path  through  the  network  (16:332-342). 

The  HEARSAY  speech  understanding  systems  are  the  most  frequently 
used  examples  of  the  modular  approach  to  using  multiple  knowledge 
sources.  The  HEARSAY-I  system  used  three  separate  knowledge  sources:  an 
acoustic/phonetic  source  which  included  information  about  variance 
between  speakers  and  about  environmental  noise,  a  syntax  knowledge 
source,  and  a  semantic  knowledge  source  (16:343-344).  Although  the 
knowledge  sources  were  very  domain  dependent,  each  acted  as  independent 
processors  and  could  be  substituted  for  or  modified  without  effecting 
the  operation  of  the  other  modules.  This  is  the  key  advantage  of  the 
modular  approach.  The  ability  to  add,  delete,  modify,  and  replace 
knowledge  sources  within  the  expert  system  is  extremely  useful  in  the 
research  environment.  The  HEARSAY-II  system  differed  from  its 
predecessor  in  a  few  ways.  HEARSAY-II  operated  in  a  more  complex 
problem  domain  and  used  more  knowledge  sources  that  had  smaller  areas  of 
specialization.  But  the  key  difference  was  in  the  overall  control  of 
processing.  The  HEARSAY-II  system  made  more  efficient  use  of  its 
computational  resources  by  restricting  the  processing  of  hypotheses  by 
the  various  knowledge  sources  to  a  limited  amount  of  best  choices.  This 
was  primarily  accomplished  by  ordering  the  lower  level  knowledge  sources 
into  an  hierarchy  where  processing  at  one  level  must  be  complete  before 
processing  at  the  next  higher  level  is  started  (16:343-348). 

In  contrast  to  the  HEARSAY  systems,  the  HARPY  speech  understanding 
system  used  the  compiled  network  approach.  The  HARPY  system  was  more 
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efficient  and  accurate  than  the  HEARSAY  systems  when  compared  in  the 
same  problem  domalns(16;351).  The  Improvement  came  from  two  basic 
concepts.  The  complete  set  of  possible  solutions  is  assembled  in  the 
system  network.  Therefore,  the  set  of  competing  solution  hypotheses  is 
smaller.  This  system  had  better  accuracy  because  it  had  fewer  incorrect 
solutions  competing  with  the  true  solution.  In  the  modular  system,  it 
is  possible  that  the  true  solution  is  never  assembled  into  an 
hypothesis.  The  other  advantage  of  the  network  system  is  speed. 
Efficient  search  heuristics  which  restrict  backtracking  can  be  used  on 
the  precompiled  network  to  produce  a  solution  more  quickly  than  if 
search  path  needed  to  be  generated  as  the  search  progressed. 

But  the  compiled  network  approach  has  inherent  disadvantages  that 
are  important  considerations  in  this  research  project.  The  obvious 
disadvantage  is  that  whenever  a  knowledge  source  is  changed,  the  entire 
network  must  be  reconstructed  (16:337).  This  is  a  time  consuming  and 
complicated  task  because  the  changes  must  be  incorporated  explicitly  at 
n  affected  branch  of  the  network.  The  other  important  disadvantage 
is  chat  in  order  to  explicitly  define  the  network,  the  problem  domain 
must  be  quite  constrained  (16:337).  This  requirement  may  be  acceptable 
when  working  in  domains  with  limited  lexicons  and  syntax  such  as  the 
postal  address  reader,  but  it  is  not  acceptable  when  the  problem  domain 
is  a  general  text  reader  where  the  full  range  of  sentence  syntax  and  an 
unrestricted  vocabulary  comprises  the  possible  input.  In  order  to 
concentrate  on  the  representation  and  cooperation  of  diverse  knowledge 
sources  in  this  research  system,  the  modular  architecture  with 


Lower  Level  Knowledge  Sources . 

The  knowledge  sources/experts  described  in  this  thesis  as  lower 
level  sources  are  chose  which  perform  postprocessing  of  text  constructs 
that  are  whole  words  or  smaller.  The  most  common  approaches  used  in 
research  of  postprocessing  systems  operating  at  this  level  are  the  use 
of  statistical  estimation  concerning  possible  letter  sequences,  and  the 
selection  or  confirmation  of  hypothesized  words  based  on  an  enumerated 
lexicon.  Both  methods  have  been  successful  in  Improving  character 
recognition  for  texts  within  limited  domains. 

The  statistical  estimation  approach  recognizes  that  there  are 
dependencies  among  characters  as  they  appear  in  words  in  a  natural 
language.  The  premise  behind  this  approach  is  that  natural  languages 
can  be  approximated  by  a  Markov  source  (10;  12:536).  Letter  transition 
frequencies,  gathered  from  sample  texts,  are  used  to  determine  the  most 
probable  letter  choice  in  accordance  with  the  frequencies  that  various 
letters  follow  the  previous  letters.  In  some  systems  only  one  previous 
letter  is  considered;  these  systems  assume  a  second-order  Markov  process 
for  their  approximations  of  digram  probabilities.  Improved  error  rates 
are  achieved  by  using  statistical  data  about  trlgrams,  three-letter 
combinations,  to  chose  the  most  probable  letter.  The  third-order  Markov 
approximation  has  a  high  resemblance  to  ordinary  English  text  (10). 
Using  fourth-order  or  higher  order  approximations  require  significant 
increases  in  computational  and  data  storage  resources  which  are 


difficult  to  justify  by  the  anticipated  performance  Improvements.  A 
third-order  system  has  2^  or  17,576  trlgram  statistics  to  store  and 
process  while  a  fourth-order  system  would  require  2^  or  456,976 
statistics  about  4-grams. 

The  statistical  approach  has  a  key  advantage  over  the  lexical 
approach.  By  strictly  relying  upon  letter  transitional  properties  at 
the  n-gram  level,  the  system  is  not  constrained  to  an  enumerated 
vocabulary.  The  vocabulary  is  constrained  by  the  source  of  the  n-gram 
statistics.  Some  legitimate  words  may  not  be  accepted  by  these  systems 
because  an  unusual  letter  combination  may  not  have  been  present  in  the 
source  texts  used  to  generate  the  letter  transition  probabilities.  The 
success  of  the  statlsitcal  estimation  approach  is  dependent  upon  the 
source  of  the  estimation  statistics.  When  the  source  of  the  estimation 
statistics  match  the  contextual  properties  cf  the  material  being 
recognized,  the  performance  is  better  than  if  the  estimation  statistics 
were  derived  from  a  general  source  and  applied  toward  recognition  of 
text  with  very  unique  characteristics  (10:550).  For  example,  the 
statistics  derived  from  typical  narrative  text  would  be  different  from 
the  statistics  derived  from  a  list  of  names;  therefore,  the  letter 
transition  predictions  made  with  both  sets  of  source  statistics  would  be 
different.  One  statistical  set  will  have  a  higher  percentage  of  correct 
predictions.  Obviously,  the  set  which  most  closely  resembles  the  text 
being  processed  will  perform  better.  An  interesting  side  effect  of  the 
statistical  approach  is  that  in  some  cases  this  process  has  been  able  to 
correct  spelling  errors  t]iat  had  been  present  in  the  original  text  being 
recognized  (11). 
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The  lexical  approach  Co  contextual  postprocessing  Is  principally  a 

dictionary  lookup  method.  In  simple  lexical  systems  the  word  hypothesis 

Is  searched  for  In  Che  dictionary  until  a  match  Is  found.  If  no  match 

Is  found  Che  word  Is  a  reject  error,  flagged  as  non-recognlzable  by  Che 

system.  More  sophisticated  systems  will  not  Immediately  reject  a  word 
Chat  Is  not  found  In  Its  dictionary.  Instead,  Che  system  may  change 

some  of  Che  letters  In  Che  word  Co  find  a  match.  This  type  of  system 

will  have  a  much  smaller  reject  error  rate,  but  the  amount  of 

substitution  errors  will  Increase.  Many  of  Che  letter  substitution 

lexical  systems  use  digram  probabilities  to  chose  Che  substituted 

letters.  However,  this  use  of  letter  transition  statistics  does  not 

expand  the  system  vocabulary;  Che  entire  lexicon  accepted  by  Che  system 

must  still  be  contained  In  Its  dictionary. 

In  limited  problem  domains,  where  the  entire  set  of  possible  words 
can  be  represented  In  Che  dictionary,  lexical  postprocessing  Is  very 
effective.  A  very  large  vocabulary  Is  required  to  apply  the  lexical 
approach  to  the  general  text  reading  problem.  As  the  dictionary  size 
grows,  the  amount  of  storage  required  and  Che  amount  of  computation 
required  Increases.  Also,  It  Is  more  likely  that  Incorrect  words  will 
be  substituted  for  words  not  In  the  dictionary  because  the  dictionary 
contains  more  words  closely  resembling  the  actual  word.  It  Is 
Impractical  to  have  the  dictionary  contain  all  possible  words; 
therefore,  any  lexical  system  is  subject  to  having  a  high  reject  error 
rate  for  some  texts  not  typified  by  Its  vocabulary  (l.e.  text  containing 
specialized  medical  terminology).  High  non-recognltlon  rates  are  not 
desirable  when  one  of  Che  key  Interests  of  the  system  design  Is  to 


Increase  automation.  Reject  errors  require  a  lot  of  operator  assistance 
during  the  recognition  process  or  a  lot  of  proofreading  after  the 
recognition  process  is  complete. 

Although  there  are  lexical  based  knowledge  sources  used  In  the 
expert  system  designed  for  this  research,  those  knowledge  sources  are 
not  the  primary  processors  in  the  system.  The  key  low  level  processor 
In  this  system  was  designed  to  remove  the  constraints  of  a  limited 
vocabulary  without  containing  the  bias  of  statistical  data.  This 
processor  is  similar  to  those  used  In  the  statistical  approach  systems 
because  It  evaluates  words  at  the  n-gram  level  and  It  Is  not  restricted 
to  a  enumerated  vocabulary.  But  the  Important  difference  from  the 
statistical  processors  Is  that  this  knowledge  source  is  not  biased  by 
the  source  of  statistical  data  because  this  sub-system  only  provides  a 
determination  of  letter  string  legitimacy  and  leaves  the  ranking  of 
choices  to  the  sensor  data  Information  and  other  knowledge  sources.  The 
key  low  level  processor  uses  variable  length  n-grams  and  word  position 
dependent  combination  rules  to  determine  which  hypothesized  words 
conform  to  normal  English  spelling  patterns. 

There  are  several  other  knowledge  sources  operating  at  the  lower 
level.  Some  of  these  are  used  to  rank  the  word  and  letter  choices, 
while  others  pass  judgement  on  the  permissablllty  of  word  hypotheses  but 
using  different  rules  than  those  used  by  the  key  processor.  For 
instance,  one  processor  knows  rules  about  the  positioning  of  apostrophes 
within  words;  those  words  which  do  not  conform  are  eliminated  from 
further  processing.  Not  every  possible  low  level  knowledge  source  was 
Implemented;  only  a  few  simple  ones  were  used  to  demonstrate  the  types 
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of  Information  which  could  be  brought  to  bare  on  the  problem. 
Additional  knowledge  sources  could  be  easily  added  to  the  modular  design 
if  this  methodology  is  used  again  in  future  research. 


Higher  Level  Knowledge  Sources. 
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The  knowledge  sources/experts  described  in  this  thesis  as  higher 
level  sources  are  those  which  perform  postprocessing  of  text  constructs 
that  are  larger  than  individual  words.  Although  it  is  possible  to  have 
these  knowledge  sources  operate  with  phrases  and  partial  sentences,  it 
is  easier  to  maintain  a  black  box  approach  to  employing  higher  level 
experts  if  this  level  of  postprocessing  focuses  on  complete  sentences. 
The  general  purpose  of  Including  the  higher  level  knowledge  sources  is 
to  add  constraints  to  the  possible  combinations  of  words  that  can  be 
hypothesized  by  the  lower  level  experts.  Two  areas  of  language 
knowledge  that  can  be  applied  to  sentence  level  hypotheses  are  syntax 
and  semantics. 

Syntactic  analysis  examines  the  word  arrangement  within  a  sentence 
to  determine  if  and  how  the  sentence  conforms  to  the  rules  of  grammar. 
This  analytic  process  is  called  parsing.  Parsing  determines  the 
relationships  of  each  word  within  a  sentence  to  the  other  words  in  the 
sentence.  Sequences  of  words  which  violate  the  rules  of  grammar  are 
rejected  by  a  parser.  There  are  many  different  approaches  to  parsing 
English  sentences.  These  different  approaches  to  parsing  represent 
attempts  to  Improve  efficiency  and  to  produce  better  interpretations  for 
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sentences  which  conform  to  more  than  one  grammatical  structure.  Many  of 
the  different  parsing  schemes  are  analogous  to  the  differences  in  search 
techniques;  some  parsers  use  exstensive  backtracking,  some  parsers 
search  for  all  syntactically  correct  solutions,  and  some  parsers  use 
heuristic  directed  search  to  find  only  one  syntactic  solution.  There 
are  also  some  parsers  which  rely  on  semantic  Information  to  find  the 
correct  interpretation  of  a  sequence  of  words.  Parsers  using  semantic 
grammars  usually  examine  only  the  keywords  of  a  sentence;  therefore, 
they  would  not  be  appropriate  for  resolving  ambiguities  which  can  occur 
with  any  word  in  a  sentence. 

Semantic  analysis  is  very  complicated  and  is  not  easy  to  define. 
One  simple  definition  of  semantic  analysis  is  determining  the  mapping 
between  the  syntactic  structure  of  a  sentence  and  some  appropriate 
relationship  between  the  objects  represented  by  the  words  in  the 
sentence.  Very  slmpllstlcally  stated,  semantic  analysis  interprets  what 
the  sentence  means  and  if  that  meaning  makes  sense. 

Currently,  the  basic  approach  to  semantic  analysis  is  to  match  the 
sentence  keywords  to  either  a  frame  or  script  knowledge  representation 
and  then  assign  other  words  in  the  sentence  to  the  various  slots  in  the 
frame  or  script.  The  frame  and  script  systems  have  expectancies  about 
which  words  can  be  associated  with  the  keywords  identifying  the  frame  or 
script.  For  example,  if  the  keyword  for  a  particular  frame  was  the  verb 
"drink,"  then  the  subject  of  the  sentence  would  be  expected  to  be  a 
person  or  animal,  and  the  object  of  the  sentence  would  be  expected  to  be 
a  liquid.  Scripts  form  similar  expectancies,  but  the  scope  of  those 
expectancies  may  extend  over  several  sentences  or  the  entire  text. 


The  basic  properties  of  parsers  and  semantic  analyzers  suggest  that 
they  are  both  operationally  constrained  to  limited  domains.  The 
analysis  of  syntax  presupposes  knowledge  of  the  parts  of  speech  which 
each  word  can  fulfill.  Therefore,  parsers  are  constrained  to  operation 
within  a  fixed  vocabulary.  Semantic  analyzers  are  also  constrained  to 
operation  within  a  fixed  vocabulary.  Semantic  analyzers  require 
knowledge  of  how  each  word  can  be  used  as  parts  of  speech,  how  each  word 
belongs  to  classes  of  objects  and  actions,  and  how  each  word  forms 
expectancies  toward  other  classes  of  actions  and  objects.  The 
complexity  and  size  of  the  knowledge  base  required  for  semantic  analysis 
increases  nonllnearly  as  the  subject  domain  (vocabulary)  increases.  It 
would  not  be  feasible  to  use  the  current  methods  of  semantic  analysis  in 
the  postprocessing  expert  system  for  reading  generic  text. 

However,  the  use  of  syntactic  analysis  may  be  adaptable  to  the 
generic  text  postprocessor.  If  only  one  or  two  words  in  an  occasional 
sentence  are  not  contained  in  the  vocabulary  for  the  parser,  then  it  may 
be  possible  to  hypothesize  the  part  of  speech  which  the  word  must 
fulfill  and  store  this  hypothesis  for  use  or  confirmation  while 
processing  the  remaining  sentences  in  the  text.  If  words  not  in  the 
parser  vocabulary  are  used  more  than  once  in  the  text,  then  this  method 
could  Increase  the  benefits  of  parsing  while  postprocessing  texts  that 
are  not  restricted  to  predetermined  vocabularies.  There  are  many 
circumstances  which  Increase  the  likelihood  that  an  unusual  word  will  be 
repeated  in  a  text.  The  word  which  is  unfamiliar  to  the  parser 
vocabulary  may  be  a  word  of  preference  in  the  author's  vocabulary. 
Also,  if  the  unfamiliar  word  is  related  to  the  specific  topic  of  the 


text,  then  the  word  Is  likely  to  appear  again.  Similarly,  words  are 
repeated  in  written  text  to  form  transitions  between  the  thoughts 
expressed  in  one  sentence  to  those  expressed  in  following  sentences. 


The  requirement  that  the  parser  be  able  to  hypothesize  the  part  of 
speech  for  an  unknown  word  makes  it  difficult  to  use  existing  parser 
systems  for  this  postprocessor.  However,  since  the  parser  information 
will  not  be  used  in  semantic  processing,  the  parser  does  not  need  to 
determine  the  optimal  grammatical  interpretation  of  the  sentence.  The 
parser  is  only  required  to  determine  if  a  correct  grammatical 
interpretation  exists.  This  may  help  in  the  modification  of  existing 
parsers  to  fit  the  needs  of  this  system.  Parser  operation  was  simulated 
in  this  thesis  because  there  was  not  sufficient  time  in  the  thesis 
schedule  to  adapt  or  develop  a  parser  which  would  provide  the  operations 
described  above. 

Noticing  the  phenomenon  that  words  tend  to  recur  throughout  a  body 
of  text  suggested  that  there  is  an  important  source  of  information  which 
has  not  been  used  by  any  of  the  lower  level  or  higher  level  experts 
mentioned  so  far.  If  the  words  used  in  each  of  the  accepted  sentence 
hypotheses  were  remembered,  these  words  could  beneficially  Influence  the 
selection  of  word  hypotheses  at  the  lower  processing  level.  This  is 
somewhat  similar  to  the  Implementation  of  a  short  term  memory  as  used  in 
the  speech  recognition  system,  SPEREXSYS  (5).  This  knowledge  source 
should  produce  the  effect  of  having  a  domain  specific  lexicon  from  which 
word  hypotheses  can  be  selected  with  increased  confidence. 

The  benefits  of  a  short  terra  memory  knowledge  source  would  not  be 
as  significant  at  the  beginning  of  text  being  processed  as  in  the  m  iddle 
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or  end  of  that  text.  It  raay  Oe  possible  to  offset  the  learning  curve 
oi  this  kaouledge  source.  Some  words  have  a  tendency  for  recurring  in 
iiany  types  of  texts  while  other  words  have  a  tendency  for  recurring  in 
texts  which  are  subject  donain  specific.  The  short  tern  nei.ory  docs  not 
need  to  start  off  blank  at  the  beginning  of  each  new  text.  Instead, 
tills  knowledge  source  could  contain  a  short  list  of  words  which  have  a 
tendency  to  recur  in  a  wide  variety  of  texts,  such  as  the  nost  likely 
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’..ords  compiled  from  the  Brown  Corpus  (4).  This  way  the  benefits  of  this 
!.nu'.  lec'ge  source  could  be  more  consistent  for  all  parts  of  the  text 
being  processed. 


Implementation  Programming  Language . 


Several  programming  languages  were  considered  for  implementation  of 
the  expert  system  proposed  in  this  thesis.  At  first,  Fortran  was 
considered  because  reading  machine  research  previously  conducted  at  AFIT 
was  accomplished  in  Fortran.  Fortran  would  be  a  good  selection  if  this 
system  was  to  interface  with  the  existing  programs  for  character 
recognition.  But  the  design  of  this  system  was  intended  to  be 
Independent  of  the  OCR  front  end.  The  OCR  front  end  could  be  simulated 
and  Fortran  was  not  necessary  to  simulate  the  OCR  output. 

Lisp  and  the  available  expert  system  building  tools,  such  as  KEL, 
were  considered.  These  programming  environments  are  well  suited  for 
i..iplementing  expert  systems.  However,  a  Lisp  based  program  would 
probably  not  be  easily  transproted  to  a  system  supporting  the  OCR 


fropt-end  processing.  Optionally,  the  OCR  program  could  be  moved  to  the 
Lisp  environment,  but  the  vast  amount  of  mathmatlcal  computation  used  In 
those  programs  would  not  adapt  well.  Also,  the  expert  system  building 
tools  were  not  used  because  I  did  not  feel  I  had  enough  time  to  learn 
the  tools  during  my  thesis  schedule,  and  because  an  attractive 
alternative  was  available. 

Pascal  was  chosen  for  the  Implementation  language  for  several 
reasons.  Pascal  Is  well  suited  to  modular  design.  The  program  code 
written  In  Pascal  Is  very  similar  to  psuedo  code  used  to  design 
algorithms.  The  resemblence  to  psuedo  code  makes  It  easier  to  use  the 
ideas  explored  by  this  thesis  in  future  research  in  almost  any  other 
programming  environment  and  to  transport  this  design  to  whatever  system 
is  driving  the  OCR  front-end.  Another  important  factor  is  the 
mathematical  processing  used  to  weight  and  combine  evidence  seemed 
easier  to  accomplish  in  Pascal  than  In  a  Lisp  environment.  And  finally, 
I  was  already  familiar  with  programming  In  Pascal.  The  postprocessor 
prototype  was  implemented  on  a  Vax  computer  using  Berkley  Pascal. 
Portions  of  the  prototype  were  developed  using  Turbo  Pascal  on  a 


minicomputer. 


Controlling  Modules. 

The  control  modules  for  the  postprocessing  expert  system  are 
arranged  in  an  hierarchical  fashion.  This  arrangement  provides  a 
flexible  framework  for  experimenting  with  the  textual  constraints 
produced  by  the  Individual  and  cooperative  efforts  of  diverse  knowledge 
sources.  However,  in  a  few  Instances  the  control  elements  are  not 
isolated  from  the  knowledge  sources.  This  occurs  at  the  lower  level 
where  some  knowledge  sources  are  embedded  with  control  statements  for 
other  knowledge  sources.  Control  statements  were  grouped  into  modules 
as  much  as  possible,  but  there  were  occasions  when  the  optimum  time  to 
activate  the  processing  of  a  supplementary  knowledge  source  was  during 
the  processing  of  the  primary  knowledge  source. 

The  expert  system  contains  five  hierarchically  arranged  control 
modules,  which  Includes  the  main  program  module.  Each  of  these  modules 
is  responsible  for  using  utilities  to  manipulate  text  structures  for  the 
next  phase  of  processing,  triggering  the  knowledge  subsystem  processors 
that  are  within  its  scope  of  control,  and  passing  control  at  appropriate 
times  to  the  control  block  which  is  the  next  lowest  in  the  control 


hierarchy.  Control  returns  upward  through  the  hierarchy  only  after  all 
actions  associated  with  a  particular  text  construct  are  completed.  As 
control  returns  to  the  higher  modules,  flags  are  also  passed  to  indicate 


the  status  of  the  portion  of  text  being  processed  (e.g.  word  completed, 
sentence  completed).  All  of  the  control  modules  reflect  the  modular 
approach  of  the  expert  system  by  containing  a  relatively  small  amount  of 
code  with  distinct  calls  to  Individual  utilities  and  expert  subsystems. 

The  main  program  module  Is  responsible  for  starting  and  stopping 
the  postprocessor.  It  also  makes  procedural  calls  to  an  Initialization 
utility  to  load  the  data  bases  for  the  various  knowledge  sources,  and  to 
output  utilities  to  store  completed  sentences  and  to  print  a  copy  of  the 
text  which  resulted  from  the  completed  postprocessing  effort.  The  main 
module  operates  the  postprocessor  by  continuously  requesting  the  next 
control  module,  MakeSentence ,  to  provide  processed  sentences. 
Processing  Is  stopped  when  a  control  flag  Indicates  that  there  Is  no 
more  text  to  be  processed. 

The  MakeSentence  module  is  the  primary  controller  for  the  higher 
level  knowledge  sources.  It  passes  control  down  the  liierarchy  by 
requesting  tlie  rnodtile,  GroupKord.s,  to  provide  the  hypotheses  for  the 
next  sentence  to  be  processed.  Once  the  hypotheses  are  received  from 
GroupWords,  the  MakeSentence  module  processes  them  with  the  higher  level 
knowledge  sources.  Since  unrestricted  domain  semantic  analysis  Is  not 
feasible  within  current  technology,  only  syntax  is  used  to  determine 
which  of  the  sentence  hypotheses  are  likely  to  be  correct.  Selection  of 
the  best  sentence  hypothesis  Is  done  by  a  utility  under  the  control  of 
the  MakeSentence  module.  After  selecting  one  hypothesis,  MakeSentence 
Invokes  another  knowledge  source  to  extract  new  words  from  the  sentence 
so  that  additional  weighting  can  be  given  to  future  hypotheses  which 
reflect  the  repetition  of  words  used  In  previous  sentences. 


GroupWords  is  considered  as  one  of  the  control  modules  for  the 
higher  level  text  processing.  Although  it  does  not  invoke  any  knowledge 
source,  it  is  responsible  for  the  interface  between  the  lower  level  and 
higher  level  knowledge  sources.  GroupWords  forms  the  output  from  the 
lower  level  processing  into  sentence  hypotheses  which  can  be  processed 
by  Che  higher  level  knowledge  sources.  If  the  higher  level  knowledge 
sources  were  to  change  substantially,  the  Interface  which  rearranges  the 
data  into  the  expected  format  could  be  easily  inserted  at  this  control 
point.  To  receive  Che  data  from  the  lower  level  processors,  GroupWords 
passes  requests  and  control  to  Che  next  control  module  in  the  hierarchy. 
That  module  is  called  GetWords. 

The  GetWords  module  is  similar  to  the  GroupWords  module  because  it 
also  does  not  directly  invoke  any  knowledge  source  which  analyzes  text 
hypotheses,  GetWords  is  responsible  for  formating  the  OCR  input  prior 
to  processing  by  the  lower  level  knowledge  sources.  It  uses  a  knowledge 
source  embedded  in  a  utility  to  limit  the  search  depth  of  letter 
hypotheses.  That  knowledge  source  uses  thresholds  which  are  dependent 
upon  the  decision  metrics  of  the  particular  OCR  front-end  being  used. 
By  isolating  the  front-end  dependent  processing  from  the  rest  of  the 
expert  system,  it  is  easier  to  adapt  this  expert  system  to  different 
optical  character  recognition  front-ends.  When  enough  OCR  input  is 
gathered,  the  data  and  control  is  passed  to  the  last  control  module, 
WordExpert,  for  analysis  by  the  lower  level  knowledge  sources. 

The  WordExpert  module  is  primarily  responsible  for  appropriately 
invoking  three  knowledge  source  modules.  The  first  of  these  modules  is 
the  spelling  expert  subsystem  which  includes  many  separate  knowledge 
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Figure  1.  Control  Module  Hierarchy 
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Thresholding  and  Word  Hypotheslzatlon. 


The  Interface  requirement  imposed  on  the  OCR  front-end  is  that  it 
be  capable  of  supplying  the  postprocessor  with  a  weighted  list  of 
possible  characters  for  each  character  being  recognized.  If  each  letter 
position  in  a  five  letter  word  were  assigned  five  possible  characters, 
there  would  be  up  to  25  different  word  hypotheses.  If  the  same  was  true 
for  each  word  in  a  simple  five  word  sentence,  there  would  be  up  to 
9,765,625  sentence  hypotheses  which  would  require  evaluation  by  the 
postprocessor.  It  is  very  necessary  to  try  to  reduce  the  search  space 
prior  to  any  complex  evaluation  by  the  postprocessor.  In  order  to 
eliminate  the  combinatorial  explosion  of  processing  that  would  occur  if 
each  letter  position  in  a  word  was  considered  uncer'.cln,  a  threshold  is 
used  to  determine  which  letters  can  be  assumed  to  have  been  accurately 
recognized  by  the  OCR  without  postprocessing. 

Many  OCRs  use  thresholds  to  define  the  tolerance  limit  for 
accepting  a  match  between  the  input  data  and  the  stored  character 
prototypes.  The  thresholds  may  be  established  on  an  individual 
character  basis  or  they  can  be  uniform  for  the  entire  character  set. 
Thresliold  asslgm  onts  are  dependent  upon  information  particular  to  the 
measuring  scheme  employed  by  each  OCR.  Input  data  which  does  not  raatcli 
a  character  prototype  within  the  tolerance  specified  hy  the  thresholds 
are  usually  flagged  by  the  OCR  as  unrecognizable  characters.  The 
alternative  to  using  thresholds  in  an  OCR  is  to  use  forced  recognition. 
Forced  recognition  is  a  method  that  requires  the  OCR  to  select  its  best 


guess  for  each  character  being  recognized.  There  is  no  distinction, 
under  a  forced  rt-ccgiii  tier  algorithm,  between  characters  which  have 
accurate  prototype  matches  and  those  which  have  inaccurate  matches.  The 
advantage  of  forced  recognition  is  that  it  does  not  require  as  much 
operator  assistance  as  a  thresholding  method.  The  trade  off  is  that  the 
forced  recognition  algorithm  has  a  much  higher  rate  of  incorrect 
recognitions. 

The  higher  automation  efficiency  of  forced  recognition  OCRs  could 
be  achieved  by  thresholding  OCRs  without  sacrificing  accuracy  if  the 
characters  flagged  as  unrecognizable  by  an  OCR  using  thresholds  were 
postprocessed  using  contextual  information  not  included  in  the  design  of 
the  OCR's  matching  prototypes.  All  characters  recognized  from  data 
falling  within  the  thresholds  would  have  the  same  error  rate  using 
either  the  thresholding  algorithm  or  the  forced  recognition  algorithm. 
The  Improvement  achieved  by  postprocessing  the  reject  characters  from 
the  thresholding  algorithm  would  come  from  overriding  the  parallel 
forced  recognition  decision  when  evidence  derived  from  the  natural 
constraints  of  the  English  language  supported  the  decision. 

This  general  approach  has  been  shown  successful  in  spoken  word 
recognition  systems  operating  in  a  limited  domain.  For  instance,  a  127 
word  vocabulary  system  designed  to  interpret  requests  for  flight 
information  and  reservations  showed  an  error  rate  improvement  from  the 
forced  recognition  error  rate  of  10.8%  to  an  error  rate  of  0.4%  by  using 
thresholds  and  syntactic/semantic  postprocessing  (17:  1619-1623;  18). 
The  address  reading  machines  used  for  postal  letter  sorting  also 
demonstrate  the  effectiveness  of  postprocessing  within  a  limited  domain 


(19:  1032,  1041).  The  expert  system  designed  in  this  thesis  uses 
postprocessing  techniques  which  attempt  to  avoid  the  restrictions  of  any 
particular  subject  domain.  The  expert  system  will  use  thresholds  to 
place  a  feasible  bound  on  the  hypothesis  search  space  to  be 
postprocessed. 

The  thresholds  should  be  selected  by  the  analysis  of  historical 
data  about  the  error  rates  for  the  particular  OCR  front->end  that  will  be 
used.  The  thresholds  should  be  set  at  a  value  which  will  obtain  a 
recognition  error  rate,  for  the  non-rejected  characters,  which  is 
slightly  better  than  the  desired  overall  error  rate.  The  characters 
rejected  by  the  threshold  will  be  postprocessed  to  improve  the  accuracy 
of  the  alternative  forced  recognition  decision  for  those  characters. 
Since  the  performance  of  the  postprocessing  is  dependent  upon  the 
accuracy  of  the  assumed  known  characters,  the  threshold  must  be  set 
smaller  than  the  threshold  required  to  achieve  the  desired  error  rate 
among  the  non-rejected  characters.  The  thresholds  used  for  testing  the 
expert  system  in  this  thesis  were  picked  using  a  prototype  distance 
matrix  for  a  low  frequency  filtered  2D-FFT  of  a  simple  character  set. 
The  distance  matrix  and  character  set  are  found  in  Appendices  B  and  C. 


Spelling  Export  Subsystem. 


The  spelling  expert  subsystem  is  the  most  essential  component  of 
this  postprocessing  system.  The  spelling  expert  is  responsible  for 
making  a  substantial  reduction  in  the  word  hypothesis  search  space  by 


eliminating  those  hypotheses  containing  non-conforming  letter  sequences. 
The  spelling  expert  subsystem  is  also  embedded  with  control  statements 
to  activate  supplementary  knowledge  sources  which  process  information  at 
the  character  sequence  and  word  levels;  these  knowledge  sources  will  be 
discussed  in  the  next  section  of  this  chapter.  The  spelling  expert  is 
an  original  approach  to  representing  the  English  language  spelling 
constraints  which  are  difficult  to  express  in  an  explicit  series  of 
rules. 

Spelling  is  a  complex  operation  to  master,  even  for  human  beings. 

Ttcijile  liiprove  their  siit^liing  proficiency  through  their  experience 

in  using  the  language.  The  rules  which  may  have  been  caught  to  us  in 

school  are  few  in  number  and  are  accompanied  by  exceptions.  Even  this 

very  familiar  spelling  rule,  taught  to  us  in  grammar  school  and  included 

in  texts  on  spelling  (20:18),  is  accompanied  by  exceptions: 

Put  "1”  before  ”e” 

Except  after  "c” 

Or  when  sounded  like  "a" 

As  in  "neighbor"  or  "weigh". 

Exceptions  Co  this  rule  Include: 

ancient  height  forfeit  efficient 

neither  weird  leisure  seize 

Besides  the  problems  of  being  an  incomplete  set  and  being 
applicable  to  only  the  most  general  cases,  spelling  rules  usually 
reference  Che  pronounclatlon  of  Che  word  or  word  root  when  deciding 
between  spelling  options.  The  circumstances  where  word  sound  knowledge 
is  required  to  use  Che  spelling  rules  include  knowledge  of  accented 
syllables,  knowledge  of  the  silent  e,  and  ’sounds-like  ...'  knowledge 


(20:  15  -  28).  Using  the  traditional  spelling  rules  as  a  basis,  it 
would  be  very  difficult  to  program  all  the  required  knowledge  into  an 
expert  system  and  it  is  very  unlikely  that  these  rules  would  provide 
enough  constraints  on  allowable  letter  sequences  for  the  speller  to  be 
effective  as  the  primary  element  of  the  postprocessor. 

The  spelling  of  a  word  is  strongly  linked  to  sounds  used  in 
speaking  that  word.  However,  it  is  not  possible  to  construct  a  mapping 
between  each  of  the  individual  letters  and  one  or  more  specific  sounds. 
This  is  a  source  of  difficulty  when  using  the  phonic  approach  to 
teaching  people  to  read  (21:  156,  169,  183).  Spelling  is  more  closely 
patterned  by  a  phonemic  representation,  where  groupings  of  contrasting 
sounds  can  be  represented  by  sequences  of  letters  (21:  156,  169). 

Nevertheless,  there  is  not  a  one  for  one  mapping  between  separate 
phonemes  and  individual  letters  (21:  183).  The  mapping  between  phonemes 
and  letter  groups  has  a  many  to  many  correspondence.  The  indeterminism 
of  these  mappings  makes  it  difficult  to  use  the  sound  constraint 
techniques  developed  in  speech  recognition  research  as  a  basis  for  the 
spelling  constraints  of  written  words. 

The  rules  used  to  develop  the  spelling  expert  for  this 
postprocessor  are  based  on  several  simple  observations  about  spelling 
patterns : 

1.  Letter  sequences  can  be  separated  into  two  major  classes, 
vowels  and  consonants. 

2.  Each  letter  is  a  member  of  only  one  of  these  classes  except  for 
the  letter,  y.  (Later,  special  provisions  in  the  spelling 
algorithm,  allow  y  to  be  classified  as  always  a  vowel.) 
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3.  Regardless  of  how  many  phonemes  are  represented  by  a  sequence 
of  consonants  or  a  sequence  of  vowels,  words  are  composed  of 
alternating  groups  of  consonants  and  vowels. 

4.  A  word  can  start  with  either  a  vowel  sequence  or  a  consonant 
sequence . 

5.  A  word  must  always  contain  a  vowel  sequence  (acronyms  and 
abbreviations  are  not  processed  by  the  speller). 

6.  The  sequences  of  vowels  and  the  sequences  of  consonants  that 
are  allowed  in  a  word  can  be  sub-classed  into  those  sequences 
which  can  occur  in  the  beginning  of  a  word,  those  which  can 
occur  in  the  middle  of  a  word,  and  those  which  can  occur  at  the 
end  of  a  word. 

A  significant  processing  advantage  for  this  representations  is  that 
it  is  much  easier  to  separate  sequences  of  consonants  from  sequences  of 
vowels  chan  to  separate  syllables  based  on  pronounciation  rules.  One  of 
the  obstacles  to  this  approach  is  the  letter,  y.  This  problem  is  solved 
by  always  classifying  y  as  a  vowel.  The  result  of  this  classification 
is  that  the  vowel  sequences  which  included  a  consonant  y,  must  include 
the  vowels  before  and  after  that  y  in  one  continuous  sequence.  The 
lists  of  vowel  sequences  are  not  very  long,  so  the  inclusion  of  these 
additional  multi-syllable  sequences  does  not  have  a  significant  impact 
on  search  speed.  However,  Che  enumeration  of  the  middle  consonant 
sequences  is  not  as  easy  to  accomplish. 

If  completely  enumerated,  the  list  of  middle  consonant  sequences 
would  be  quite  long  (appproximately  2,400  entries).  Instead,  the  list 
of  middle  sequences  could  be  derived  from  a  short  list  (only  36  entries) 
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of  consonant  sequences  wliich  uay  be  used  individually  as  a  complete 
sequence  or  combined  with  the  beginning  consonai*t  sequences  (67  entries) 
to  form  a  middle  sequence.  A  separate  module  is  used  to  find  Che 


appropriate  split-up  of  the  middle  consonant  sequence  in  a  word  so  that 
two  smaller  consonant  sequences  can  be  matched,  one  each,  against  the 
middle  and  beginning  consonant  lists. 

The  processing  flow  of  letter  sequence  checking  is  summarized  in 
Figure  2.  A  word  hypothesis  is  initially  checked  to  verify  the  presence 
of  a  vowel  sequence.  If  the  word  is  completely  composed  of  vowels,  a 
special  all-vowel  sequence  list  is  checked  for  a  valid  combination. 
Only  those  vowel  sequences  which  form  known  words  are  included  In  that 
list.  If  the  word  hypothesis  is  not  an  all  vowel  sequence,  the  normal 
processing  of  alternating  vowel  and  consonant  sequences  continues.  The 
first  letter  sequence  is  matched  against  either  the  beginning  vowel  list 
or  the  beginning  consonant  list,  whichever  list  is  appropriate.  If  the 
first  sequence  was  a  consonant  type,  the  next  sequence  is  a  vowel  group 
and  is  matched  against  the  middle  vowel  sequence  list.  If  the  first 
sequence  was  a  vowel  guup  then  an  initial  attempt  is  made  to  match  the 
following  consonant  group  against  the  list  of  middle  consonant 
sequences.  If  that  attempt  fails,  the  consonant  sequence  is  split  into 
two  sequences,  with  the  first  sequence  having  a  maximum  length  of  three. 
The  second  sequence  may  initially  be  empty.  Then,  in  successive  trials 
until  two  matches  are  found,  a  letter  is  removed  from  the  first  sequence 
and  pushed  into  the  second  sequence  and  new  attempts  are  made  to  match 


these  sequences  against  the  beginning  and  middle  consonant  lists. 


The  matching  process  loop  for  middle  vowel  sequences  and  middle 
consonant  sequences  continues  until  the  final  vowel  or  consonant 
sequence  in  the  word  is  found.  If  the  final  sequence  in  the  word  is  a 
consonant  group,  it  is  matched  against  the  ending  consonant  sequence 
list.  However,  ending  vowel  groups  use  the  same  sequence  list  as  middle 
vjwels,  with  the  exception  of  the  all-vowel  words.  Any  sequence  in  a 
hypothesized  word  which  fails  the  matching  routines  causes  the  word  to 
be  rejected  as  a  hypothesis. 

The  decision  to  reject  a  word  hypothesis  can  be  overruled  by  the 
spelling  expert  subsystem.  After  the  spelling  routine  is  completed,  a 
rejected  hypothesis  is  checked  against  a  list  of  known  exceptions.  For 
instance,  the  list  of  beginning  vowel  sequences  does  not  Include  the 
double  a;  therefore,  "aardvark”  is  included  in  a  list  of  exceptions. 
Aardvark  would  be  rejected  by  the  spelling  expert  algorithm;  but,  the 
hypothesis  rejection  would  be  overruled  by  the  knowledge  source  using 
the  list  of  exceptions.  The  complete  list  of  exceptions  and  unusual 
words  can  be  found  in  the  postprocessor  program  listings  in  Appendix  A. 
Some  of  the  words  in  the  exceptions  list  may  not  seem  as  unusual  as 
aardvark,  but  they  contain  a  vowel  or  consonant  sequence  which  is  found 
in  only  a  few  known  words.  Their  inclusion  into  the  exceptions  list 
results  in  additional  constraints  for  spelling  approval  of  a  word 
hypothesis.  The  overall  performance  of  the  spelling  expert  subsystem  is 
improved  by  using  the  exceptions  list  without  compromising  the  bounds  of 
the  postprocessor  vocabulary. 


Supplementary  Knowledge  Sources. 

There  are  several  other  knowledge  sources  operating  within  the 
framework  of  the  spelling  expert  subsystem.  These  supplementary 
knowledge  sources  are  included  In  the  subsystem  to  either  add  more 
constraints  to  the  letter  sequence  allowed  in  word  hypotheses,  or  to 
demonstrate  how  specialized  knowledge  sources  can  be  modularly  Included 
to  handle  unique  processing  difficulties  that  occur  at  the  word  level. 
Only  two  of  the  specialized  knowledge  sources  are  Included  to  serve  as 
examples  of  how  extra  processing  rules  and  heuristics  can  enhance  the 
postprocessor  in  a  modular  fashion.  The  two  specialized  knowledge 
sources  are  used  to  process  words  with  apostrophes  and  words  with 
hyphens. 

The  knowledge  source  which  applies  additional  constraints  to  the 
selection  of  allowable  letter  sequences  is  very  essential  to  increasing 
the  effectiveness  of  the  spelling  expert.  No  constraints  which  can 
analyze  letter  sequences  that  include  both  consonants  and  vowels  are 
implemented  in  the  principle  spelling  expert.  Also,  the  letter  sequence 
constraints  imposed  by  the  principle  spelling  algorithm  are  very  liberal 
because  they  allow  a  large  amount  of  non-words  to  be  accepted  as 
hypotheses.  That  design  feature  was  intentionally  included  to  free  the 
system  vocabulary  from  any  enumerable  bounds.  Those  letter  sequences, 
thought  of  as  non-words  under  present  day  English,  could  become  the 
newly  coined  words  or  variations  on  current  words  in  the  future  versions 
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of  our  language.  This  supplementary  knowledge  source  is  a  convenient 
place  to  add  and  remove  spelling  rules;  especially  as,  English  may  be 
applicable  to  more  and  more  rules  as  it  continues  its  trend  towards  a 
more  standardized  lexicon  (i.e.  most  new  verbs  have  regular 
conjugation) . 

As  a  proposed  spelling  rule  or  spelling  generalization  approaches 
complete  adherence  by  the  language,  that  rule  can  be  added  to  the 
supplementary  spelling  expert  and  the  small  number  of  exceptions  to  that 
rule  can  be  included  in  a  special  lookup  list.  For  instance,  the 
spelling  rule,  "Q  must  be  followed  by  U,“  is  included  in  this  knowledge 
source  with  its  exceptions,  such  as  qoph  and  qat,  included  in  a  list  of 
unusual  words.  In  contrast,  if  a  spelling  rule  was  proposed  and 
included  in  the  supplementary  expert,  but  it  was  later  discovered  that 
there  are  wide  spread  exceptions  to  that  rule,  then  the  rule  can  be 
easily  removed  without  disrupting  the  operation  of  the  entire  spell 
checker  algorithm.  A  few  of  the  other  rules  included  in  the 
supplementary  expert  are: 

1.  The  sequence  "pn"  must  be  followed  by  the  letters,  "ue". 

2.  The  sequence  "ght"  must  be  preceded  by  tlie  letters,  "au",  "ou", 
or  "i". 

One  other  type  of  spelling  rule  is  included  in  the  supplementary 
expert.  These  rules  are  needed  when  the  design  of  the  principle 
spelling  expert  is  not  able  to  constrain  an  obviously  incorrect 
sequence.  It  would  be  difficult  to  modify  the  principle  algorithm 
directly.  Therefore,  this  supplementary  spelling  expert  is  a  convenient 


means  to  correct  an  unintentional,  but  inherent  problem  in  the  principle 


spelling  algorithm.  The  specific  problem  which  appeared  was  the 
approval  or  word  hypotheses  which  contained  a  triple  1  as  a  middle 
consonant  sequence. 

By  using  these  supplementary  rules,  the  entire  spelling  system 
becomes  adaptive  to  our  living  and  growing  language.  The  rules 
implemented  are  not  expected  to  be  an  exhaustive  set  for  our  current 
version  of  English.  However,  the  inclusion  of  this  implementation 
design  provides  the  means  for  updating  and  improving  the  spelling  expert 
subsystem  as  it  is  needed. 

The  two  specialized  knowledge  sources,  for  hyphenated  words  and 
words  with  apostrophes,  which  are  triggered  within  the  principle 
spelling  expert  could  have  been  placed  under  the  control  of  the 
WordExpert  control  module.  Although  it  was  convenient  to  test  the 
conditionals  for  these  two  supplementary  knowledge  sources  during  the 
data  manipulation  of  the  spelling  system,  the  inclusion  of  additional 
supplementary  modules  within  the  speller  is  strongly  discouraged. 
Embedding  the  control  of  additional  modules  into  the  spelling  module 
could  disrupt  the  cohesion  of  the  spell  checking  process,  especially 
when  these  supplementary  modules  are  added,  deleted,  or  modified. 

The  apostrophe  expert  functions  by  first  locating  the  apostrophe 
within  the  word  hypothesis.  It  then  matches  the  contracted  letters  to  a 
list  of  known  possibilities.  The  root  word  is  extracted  and  passed  back 
to  the  spelling  expert  for  further  evaluation.  The  hyphenated  word 
expert  functions  in  a  similar  fashion.  It  first  locates  the  hyphen 
within  the  word  hypothesis  in  order  to  separate  the  word  into  two 
smaller  words.  After  that,  the  hyphen  expert  recursively  calls  the 


spelling  expert  for  each  of  the  two  parts  of  the  hyphenated  word.  After 
the  second  pass  through  the  spelling  expert,  control  is  returned  to  the 
V'JordExpert  nodule.  If  the  first  part  of  the  hyphenated  word  is  rejected 
by  the  speller,  tiie  second  part  of  the  hyphenated  word  is  not  processed. 

Other  supplementary  knowledge  sources  could  be  .aodularly  added  to 
this  system.  The  additional  mini-experts  could  apply  heuristics  to  make 
judgements  on  problems  inherent  to  some  texts.  A  simple  example,  which 
is  used  by  some  commercial  reading  machines,  is  deciding  between  a  lower 
case  1  (el)  and  the  numeral,  1  (one).  These  characters  are  identical  in 
some  fonts.  The  typical  approach  is  to  base  the  decision  on  the 
surrounding  characters.  If  it  is  surrounded  by  other  numerals,  decide 
for  the  number,  1;  likewise,  if  it  is  surrounded  by  letters,  decide  for 
the  letter,  1. 

Another  possibility  for  a  supplementary  knowledge  source  is  an 
acronym  expert.  The  acronym  expert  could  possibly  examine  the  first 
letters  in  the  words  preceding  the  acronym  hypothesis.  The  expert  could 
use  one  or  more  letters  in  each  of  the  immediately  preceding  words  to 
match  one  or  more  letters  in  the  acronym  hypothesis.  This  heuristic 
could  be  applied  for  the  initial  occurence  of  an  acronym.  Discovered 
acronyms  would  have  to  be  stored  for  matching  against  subsequent 
appearances  of  the  acronym.  The  expert  would  be  required  to  make 
choices  between  established  acronyms  and  newly  hypothesized  acronyms. 
Implementation  of  this  supplementary  knowledge  source  could  greatly 
enhance  this  postprocessor  by  removing  one  of  the  primary  constraints 
placed  on  the  input  text.  Unf ortionately,  non-trivial  experts,  such  as 
the  acronym  expert,  involve  a  significant  design  and  programming  effort. 


and  could  not  be  Included  Into  the  Inplementation  studied  under  this 


thesis.  Implementations  of  such  experts  can  easily  be  appended  to  this 
postprocessor  design  and  are  left  as  a  suggestion  for  future  reading 
machine  research  projects. 


Other  Word  Level  Knowledge  Sources. 

There  are  two  other  knowledge  sources  which  influence  the  weighting 
of  competing  word  hypotheses.  One  knowledge  source  represents  the 
vocabulary  of  the  sentence  parser  used  at  the  higher  processing  level. 
The  other  knowledge  source  represents  a  short  term  memory  which  is  used 
to  increase  the  belief  weighting  for  words  that  are  used  frequently  in 
text  being  processed.  Both  of  these  knowledge  sources  are  implemented 
by  simple  dictionary  lookup  schemes. 

The  purpose  of  the  knowledge  source  that  represents  the  vocabulary 
of  the  sentence  parser  is  to  indirectly  increase  the  belief  weighting  of 
sentences  which  can  be  processed  by  the  parser.  By  performing  the 
weighting  at  this  level  of  processing,  repetitive  searches  for  the  same 
word  are  avoided  when  the  parsing  operation  takes  place.  Ideally,  the 
part  of  speech,  corresponding  to  the  hypothesized  word,  would  be  stored 
in  the  data  structure  containing  the  word  for  later  use  by  the  parser. 
However,  the  parser  operation  is  simulated  in  this  postprocessor 
prototype;  therefore,  the  parts  of  speech  associated  with  each  word  are 
not  Included  in  the  word  hypothesis  data  structure  or  the  dictionary 
file.  The  belief  weightings  of  sentence  hypotheses  are  increased  in 


proportion  Co  Che  number  of  word  In  Che  sentence  Chat  are  known  Co  Che 
parser.  The  belief  measure  of  each  word  hypothesis  found  in  the  parsing 
dictionary  is  increased  by  constant  factor.  Thus,  the  belief  measure 
for  each  sentence  hypothesis,  which  is  a  combination  of  word  belief 
measures,  is  Influenced  in  proportion  to  the  number  of  words  known  by 
the  parser. 

In  a  manner  similar  to  Chat  of  the  parsing  dictionary  knowledge 
source,  the  short  term  memory  knowledge  source  influences  the  belief 
measures  of  word  hypotheses.  Again,  a  word  hypothesis  which  is  found  in 
Che  lexicon  of  the  short  term  memory  has  its  belief  measure  increased. 
However,  Che  aim  of  this  knowledge  source  is  to  influence  the  belief 
measures  of  individual  words,  as  opposed  to  the  aim  of  influencing  the 
corresponding  sentence  belief  measures.  Some  of  the  words  used  within 
any  body  of  text  will  have  a  natural  tendency  to  recur.  This  is 
partially  the  reason  why  the  frequency  of  individual  word  occurences  in 
Che  English  language  can  be  modeled  by  Zipf's  Law  (4;  12:52).  The 
reason  for  calling  this  dictionary  type  of  knowledge  source  a  short  term 
memory  is  because  Che  size  of  the  dictionary  is  a  fixed  length  and  it 
contains  only  Che  most  recent  words  used  in  the  text. 

If  the  short  term  memory  is  reinitiated  for  each  separate  text 
being  processed,  the  benefits  of  the  short  term  memory  would  be 
appreciably  small  at  the  beginning  of  the  text  being  processed.  This 
effect  can  be  offset  by  initializing  the  short  term  memory  with  a  small 
amount  of  words.  These  words  must  be  selected  with  minimum  bias  toward 
any  particular  subject  domain  in  order  to  maintain  consistent 
performance  of  the  postprocessor  for  a  wide  variety  of  input  texts. 


There  are  several  circumstances  which  cause  some  words  to  be 
repeated  In  a  text  more  often  than  others.  If  the  text  concerns  a 
specific  subject  domain,  then  words  peculiar  to  that  subject  should 
appear  quite  often.  The  working  vocabulary  of  the  author  will 
undoubtedly  influence  the  words  found  in  a  specific  text.  These  are 
good  reasons  for  having  the  short  term  memory  but  they  are  not  good 
foundations  for  picking  the  initialized  memory.  But  there  are  other 
reasons  for  finding  the  repetition  of  words  within  a  text.  Some  words 
are  used  repetlvely  in  order  to  form  transitions  between  thoughts 
expressed  from  one  sentence  to  the  next.  Other  words  are  repeated 
because  they  are  necessary  to  format  our  thoughts  within  the  syntax  of 
the  language  (l.e.  conjunctions,  prepositions,  and  articles).  Also, 
some  words  may  be  repeated  because  written  thoughts  are  often  expressed 
V#  with  respect  to  some  common  references  about  time,  space,  people,  or 

object  relationships  (e.g.  today,  there,  he,  more).  The  words  that  are 
used  repeatedly  because  of  the  manner  in  which  English  is  naturally 
written  should  be  the  basis  for  initializing  the  short  term  memory. 

If  a  very  large  sample  of  English  texts  were  analyzed  to  determine 
which  words  occur  most  often,  the  words  which  appear  at  the  top  of  that 
list  should  correspond  to  the  words  that  appear  in  a  most  texts  because 
of  the  nature  of  written  English.  Words  which  appear  at  the  bottom  of 
the  frequency  listing  are  likely  to  be  there  because  of  the  particular 
subjects  and  authors  that  were  sampled.  In  1967,  a  study  of  word 
occurence  frequencies  was  performed  on  sample  of  English  texts 
containing  a  total  of  1,014,232  running  words.  The  collection  of  texts 
was  distributed  among  500  samples  of  approximately  2,000  words.  The 


50 


samples  were  chosen  randomly  from  a  wide  variety  of  subject  domains  and 
prose  styles  with  the  exclusions  of  poetry  and  drama.  Analysis  of  this 
collection  of  texts,  tcnown  as  the  Standard  Corpus  of  Present-Day  Edited 
American  English  (also  called  the  Brown  Corpus  after  the  university  at 
which  the  study  took  place),  enumerated  exactly  50,406  different  words. 
The  ten  most  frequent  words  and  their  occurence  frequencies  were  (21): 


the 

69,971 

of 

36,411 

and 

28,852 

to 

26,149 

a 

23,237 

in 

21,341 

that 

10,595 

is 

10,099 

was 

9,816 

he 

9,543 

The  short  term  memory  is  Initialized  with  the  200  most  frequent 
words  from  the  Brown  Corpus.  As  sentences  are  approved  by  the 
postprocessor,  the  words  which  do  not  already  appear  in  the  short  term 
memory  are  added.  When  the  size  limit  of  400  words  is  reached,  only  the 
most  recently  used  words  are  stored  in  the  short  term  memory.  The  limit 
of  400  words  was  imposed  to  promote  efficiency  and  prevent  duplication 
of  the  parser  dictionary  effects. 


Syntactic  Expert. 


The  only  sentence  level  analysis  that  is  performed  by  the 
postprocessor  is  an  analysis  of  the  grammar  used  in  the  sentence.  If  a 
sentence  hypothesis  conforms  to  any  of  the  arrangements  of  wo^  \s  allowed 


by  the  rules  of  English  syntax,  the  sentence  hypothesis  is  permlted  to 
compete  for  the  final  selection  of  the  best  sentence.  Sentences  which 
do  not  conform  to  proper  syntax  are  eliminated  from  the  solution  search 
space.  For  sentences  with  words  that  are  not  in  the  parser's  lexicon, 
the  unknown  words  are  treated  as  wild  card  parts  of  speech  and  only  the 
known  words  determine  whether  the  sentence  structure  conforms.  The 
unknown  words  and  their  proposed  parts  of  speech  can  be  stored  for 
future  use  by  the  parser.  Prior  to  syntactic  processing  the  group  of 
sentence  hypotheses  are  ranked  in  order  of  their  relative  measures  of 
belief.  The  highest  ranking  sentence  is  processed  by  the  parser  first. 
If  that  sentence  parses,  no  further  processing  by  the  parser  is 
required.  If  no  sentence  in  the  group  is  approved  by  the  syntax  expert, 
the  highest  ranking  sentence  prior  to  the  parsing  attempt  is  fowarded  as 
the  best  solution  for  the  sentence.  The  words  from  this  sentence  are 
then  used  to  update  the  short  term  memory  and  the  parsing  dictionary. 

The  Implementation  of  the  syntactic  parser  is  simulated  through  an 
interactive  input.  The  parsing  vocabulary  size  was  left  undetermined. 
For  test  purposes,  a  word  was  considered  to  be  in  the  parsing  vocabulary 
if  it  was  found  in  a  standard  dictionary  of  American  English.  No 
partlcualr  dictionary  is  cited  because  the  testing  of  the  postprocessor 
did  not  emphasize  the  vocabulary  of  the  parser.  Testing  of  the  parser 
focused  on  its  general  contribution  to  the  postprocessor  accuracy. 


Review  of  Goals. 


IV.  TestlBit  and  Results 


The  overall  objectives  of  the  English  Sentence  postprocessing 
system  are  to  improve  the  accuracy  and  the  throughput  of  reading  machine 
technology.  The  postprocessor  develolped  with  this  research  is  used  to 
study  how  an  expert  system  can  apply  diverse  knowledge  sources  to  reduce 
the  uncertainty  of  optical  character  recognition  and  to  remove  any  text 
input  restrictions  that  are  concerned  with  a  fixed  vocabulary  or  subject 
domain.  Some  of  the  specific  questions  that  are  examined  are: 

1.  Are  the  knowledge  sources  which  process  text  at  a  variety  of 
levels,  from  characters  on  up  to  complete  sentences  (or 
higher),  able  to  effectively  combine  evidence  to  produce  a 
more  accurate  output? 

2.  Is  the  spelling  expert  able  to  operate  effectively  without 
imposing  a  limited  vocabulary  or  presumed  subject  domain  bias 
on  the  text  input? 

3.  Are  the  Interfaces  to  the  sentence  parser  and  to  the  OCR 
front-end  designed  to  easily  accept  a  wide  variety  of  black 


box  substitutions? 


4. 


Is  the  Short  Term  Memory  useful  in  improving  the  accuracy  of 
the  postprocessing  system? 


Method. 


Testing  of  the  postprocessor  was  done  by  observing  the  processing 
results  of  some  carefully  selected  input  and  some  random  input.  The 
selective  Inputs  were  chosen  with  knowledge  of  the  program  design  and 
expected  program  limitations  to  examine  the  general  operation  and  the 
boundary  conditions  of  the  various  algorithms  and  heuristics  used  In  the 
postprocessor.  The  randomly  selected  Input  was  used  In  an  attempt  to 
uncover  some  unexpected  limitations  of  the  program.  Approximately, 
2,000  Individual  words  or  psuedo-words  were  tested  on  the  spelling 
algorithm  and  the  entire  system  was  tested  with  sentences  totalling  over 
500  words.  The  amount  of  tests  may  seem  small  In  comparison  to  the 
domain,  but  by  directing  the  non-random  portion  of  the  tests  at 
suspected  weaknesses,  the  conditions  which  would  reduce  the 
effectiveness  of  the  postprocessor  were  identified. 

To  perform  the  analysis  of  the  postprocessing  system,  the  program 
was  embedded  with  screen  output  statements  to  observe  the  changes  that 
take  place  for  the  size  of  the  hypothesis  search  list  and  for  the 
combination  and  modification  of  hypothesis  weightings.  In  many  cases,  a 
reference  to  the  knowledge  source  or  the  specific  rule  that  was 
responsible  for  eliminating  a  hypothesis  from  the  solution  search  list 


was  displayed  to  the  terminal.  A  majority  of  the  testing  was  focused  on 
the  performance  of  the  spelling  expert  because  of  the  novelty  of  Its 
algorithm  and  because  that  component  was  needed  to  perform  a  significant 
reduction  in  the  search  space  prior  to  the  processing  of  other  knowledge 
sources  In  the  postprocessor. 

Observations. 

1.  Are  the  knowledge  sources  which  process  text  at  a  variety  of 
levels,  from  characters  on  up  to  complete  sentences  (or  higher),  able  to 
effectively  combine  evidence  to  produce  a  more  accurate  output? 

Each  of  the  knowledge  sources  used  by  the  postprocessor  effects  the 

solution  decision  In  one  of  two  basic  ways.  A  knowledge  source  can 

limit  or  reduce  the  number  of  competing  solution  hypotheses,  or  a 

knowledge  source  can  modify  the  belief  measure  associated  with  each  of 

the  hypotheses.  The  following  list  categorizes  the  effects  of  the 

knowledge  sources  used  in  the  postprocessor: 

Thresholds  -  Restrict  the  number  of  characters  competing  for 
the  same  letter  position  within  a  word. 

Apostrophe  Expert  -  Reduces  the  number  of  competing  word 
hypotheses  If  a  rule  violation  occurs. 

Hyphenation  Expert  -  Reduces  the  number  of  competing  word 
hypotheses  If  a  rule  violation  Is  found  by  the  spelling 
expert  subsystem  for  any  part  of  a  hyphenated  word. 

Spelling  Expert  -  Reduces  the  number  of  competing  word 
hypotheses  if  a  rule  violation  occurs  for  sequences  of 
consonants  or  sequences  of  vowels  within  a  word  hypothesis. 


Supplementary  Spelling  Rules  -  Reduces  the  number  of 
competing  word  hypotheses  If  a  rule  violation  occurs. 

Exceptions  Data  Base  -  Overrules  a  word  hypotheses 
reduction  decision  made  by  any  of  the  above  word  level 
knowledge  sources. 

Vocabulary  -  Can  modify  the  relative  weightings  of  competing 
word  hypotheses  based  on  a  dynamic  parsing  vocabulary. 

Short  Term  Memory  -  Can  modify  the  relative  weightings  of 
competing  word  hypotheses  based  on  contextual  recurrences. 

Syntactic  Parser  -  Reduces  the  number  of  competing  sentence 
hypotheses  if  a  violation  of  grammar  rules  occurs. 

Whether  the  knowledge  sources  narrowed  the  solution  search  space  or 
endorsed  the  plausibilities  of  particular  hypotheses,  the  ability  to 
improve  the  accuracy  of  the  solution  decision  for  each  knowledge  source 
could  be  associated  with  a  general  set  of  circumstances  about  the  Input 
data.  Overall,  the  Individual  and  the  combined  efforts  of  the  knowledge 
sources  In  the  expert  system  were  useful  in  choslng  the  correct 
hypothesis  for  a  large  amount  of  test  cases  where  the  OCR  front-end, 
employing  a  forced  decision  methodology,  would  have  chosen  Incorrectly. 
Some  examples  of  the  performance  improvement  are  shown  In  figures  3  and 
4,  on  the  next  pages  (sample  text  from  mystery  novel  (23:108)).  These 
particular  test  samples  showed  Increases  In  the  individual  chararcter 
recognition  rates  by  8.4%  and  13.4%.  The  amount  of  Improvement  achieved 
through  postprocessing  varied  with  the  input  baseline  accuracy.  The 
circumstantial  word  and  senetence  ambiguities  presented  by  text  input 
also  affected  the  recognition  performance.  Overall,  It  might  be 
misleading  to  affix  specific  numbers  to  represent  the  character 
recognition  Improvement  factors  that  were  based  only  upon  a  collection 


Forced  Recoanitlon:  90.4%  accuracy. 


Ihere  wes  nc  dcuht  in  bis  mind.  Eor  some  months  he  had 
found  himself  entertaining  wiid  anb  melodramatic  suepioions. 
He  told  me  that  he  hed  been  aonvinced  that  his  wife  was 
administering  dnugs  to  him.  Be  had  lived  in  India  and  the 
practice  of  wives  driving  Lheir  husbands  insane  by  poisoniny 
often  cemes  up  in  the  native  courts.  He  had  soffered  fairlg 
often  fnom  hallucinations  with  confusion  in  his  mind  about 
tine  end  qlace. 

Using  Postprocessor;  98.8%  accuracy. 

There  was  no  doubt  in  his  mind.  For  some  months  he  had 
found  himself  entertaining  wild  and  melodramatic  suepicions. 
He  told  me  that  he  had  been  convinced  that  his  wife  was 
administering  drugs  to  him.  He  had  lived  in  India  and  the 
practice  of  wives  driving  their  husbands  insane  by  poisoniny 
often  comes  up  in  the  native  courts.  He  had  soffered  fairly 
often  from  hallucinations  with  confusion  in  his  mind  about 
time  and  place. 


Forced  RecoRnition:  79.9%  accuracy. 


Ihore  raee  nc  dcuht  In  bis  wind.  Eor  same  months  he  had 
found  himself  entertaining  wiid  anb  welcdnematle  sueyioions. 
He  boid  ma  that  he  hed  been  aonvinced  that  his  wife  wes 
administering  dnugs  to  bim.  Be  had  lived  in  India  and  tbe 
pnactioe  of  wives  brivinygLheir  husbanhs  insane  bg  poiaoniny 
often  cemes  up  in  the  retive  oourts.  He  had  suffered  fairlg 
eften  fnom  ballocinabions  with  confusion  In  his  mind  about 
time  end  qlace. 


Using  Postprocessor;  93.3%  accuracy. 

Thore  mee  no  doubt  in  his  wind.  For  same  months  he  had 
found  himself  entertaining  wild  and  welodnematle  suepicions. 
He  boid  ma  that  he  had  been  convinced  that  his  wife  was 
administering  drugs  to  him.  He  had  lived  in  India  and  the 
practice  of  wives  briving  their  husbands  insane  by  poisoniny 
often  comes  up  in  the  retive  courts.  He  had  suffered  fairly 
often  from  ballocinabions  with  confusion  in  his  mind  about 


time  and  place. 


of  tests  which  is  very  small  relative  to  the  possibilities  that  exist 
for  generic  text  input  data.  Therefore,  a  qualitative  analysis  was  done 
to  descrive  the  input  data  circumstances  for  which  the  expert  system  is 
unsuccessful  in  resolving  the  recognition  uncertainty. 

In  general,  the  combination  of  evidence  was  necessary  to  resolve 
some  uncertainties  chat  could  not  be  resolved  by  any  individual 
knowledge  source  within  Che  postprocessor.  This  is  demonstrated  through 
a  summary  of  Che  processing  steps  for  a  simple  test  case.  The  actual 
sentence  being  postprocessed  is,  "The  hat  was  brown." 

The  second  word  was  input  with  two  letter  positions  having 
uncertainty.  The  candidates  for  the  h-position  were  b,  h,  and  k.  The 
candidates  for  the  a-positlon  were  a,  e,  and  c.  These  letter  candidates 
resulted  in  nine  word  hypotheses;  bat,  hat,  kat,  bet,  het,  ket,  bet, 
het,  and  kct.  The  spelling  algorithm  eliminated  the  last  three 
hypotheses  immediately  because  they  contained  no  vowels.  The  parsing 
vocabulary  recognized  bat,  hat,  and  bet.  Therefore,  those  words  were 
given  higher  belief  weightings  over  the  other  three  nonsense  words.  The 
short  term  memory  recognized  the  word,  hat,  from  its  use  in  a  previous 
sentence.  Therefore,  the  weighting  of  hat  was  Increased  so  that  it  now 
ranked  higher  than  bat,  bet,  or  any  of  the  other  hypotheses.  The 
sentence  hypothesis  containing  hat  was  the  first  hypothesis  examined  by 
Che  parser.  Since  it  conformed  to  acceptable  grammar,  the  sentence  was 
accepted  as  the  solution.  The  combination  of  letter,  word,  and  sentence 
level  information  was  successful  in  choslng  Che  correct  hypothesis. 
However,  there  are  circumstances  which  could  produce  the  wrong  solution 


if  they  are  present  in  the  data. 


One  data  circumstance  which  heavily  influences  the  eventual  outcome 
is  the  quality  of  the  input  sensor  data.  If  the  characters  being  read 
were  very  noisy,  the  word  level  knowledge  sources  that  Increase  the 
belief  weightings  for  familiar  words  may  not  be  able  to  compensate  for  a 
low  weighting  of  the  correct  word  hypothesis.  In  that  situation,  the 
unaided  OCR  would  have  made  an  incorrect  decision  too.  An  obvious 
extreme  case  of  this  problem  occurs  when  the  correct  letter  hypothesis 
is  not  within  the  top  five  choices  selected  by  the  thresholding  module. 
For  that  situation,  there  is  no  avoiding  an  incorrect  decision.  The  OCR 
must  be  sufficiently  accurate  that  the  correct  choice  is  initially 
weighted  within  the  hypothesis  threshold. 

There  are  circumstances  where  the  unaided  OCR  could  have  made  the 
correct  choice,  but  it  was  led  astray  by  the  postprocessor.  This 
situation  is  most  notable  for  the  short  term  memory  knowledge  source  and 
will  be  discussed  in  more  detail  under  question  four.  For  cases  where 
insufficient  evidence  is  available  to  improve  the  decision,  the 
hypothesis  selection  reverts  back  to  the  same  information  that  is 
available  to  the  unaided  OCR.  In  the  example  used  above,  if  the  word, 
hat,  was  not  recognized  by  the  short  term  memory,  the  postprocessor  and 
the  unaided  OCR  may  have  chosen  the  word,  bat.  A  human  reader  could  not 
do  any  better  unless  high  level  semantic  information,  concerning  the 
general  subject  of  the  text,  was  available. 

Other  situations  where  the  postprocessor  can  inject  errors  into  the 
recognition  process  occur  when  the  expected  input  constraints  are 
violated.  The  postprocessor  has  difficulty  with  accronyms, 
abbreviations,  and  some  proper  names.  The  erroneous  hypothesis 


rejections  occur  when  these  words  do  not  conform  to  typical  English 
phonemic  structure.  Another  specialized  knowledge  source  could  be  added 
to  the  postprocessor  to  handle  the  accronyms  and  abbreviations;  a 
jHissille  leuristic  for  designing  this  knowledge  source  is  suggested  in 
the  next  chapter.  The  difficulty  with  unusual  proper  names  cannot  be 
similarly  resolved  for  a  postprocessor  which  is  designed  not  to  be 
domain  dependent.  Another  circumstance  for  extra  errors  occurs  when  the 
input  sentences  are  not  grammatically  complete.  If  the  actual  sentence 
has  Incomplete  grammar  conformance,  the  parsing  expert  can  select  an 
incorrect  sentence  hypothesis  that  has  an  unknown  word  used  as  a  wild 
card  part  of  speech.  Spelling  mistakes  in  the  input  text  are  another 
violation  of  the  prescribed  input  constraints  that  can  cause  errors  to 
propogate  from  the  postprocessor.  Misspelled  words  can  be  accepted  by 
the  postprocessor  when  the  misspelled  word  conforms  to  natural  phonemic 
structure  and  it  is  not  competing  against  a  word  hypothesis  that  is  a 
common  English  word.  Some  misspelled  words  can  even  be  corrected  by  the 
postprocessor  if  the  incorrect  letter  was  considered  by  the  thresholds 
as  an  uncertain  character.  But  the  chances  of  this  happening  along  with 
the  other  necessary  conditions  are  very  small.  The  type  of  misspelling 
which  causes  the  most  problems  is  the  transposition  of  letters. 
Transpositions,  especially  within  long  consonant  sequences,  often 
produce  sequences  which  are  not  allowed  in  natural  English  words.  If  a 
misspelled  word  is  not  rejected  by  the  spelling  expert,  there  is  a  good 
chance  that  the  correct  hypothesis  will  be  slighted  by  the  other 
knowledge  sources  in  the  postprocessor. 


2.  Is  the  spelling  expert  able  to  operate  effectively  without 
Imposing  a  limited  vocabulary  or  presumed  subject  domain  bias  on  the 
text  Input? 

The  spelling  expert  did  not  exhibit  any  unintentional  vocabulary 
restrictions  during  testing.  Several  types  of  test  Input  data  were  used 
to  validate  the  expected  operation  of  the  primary  spelling  algorithm. 
One  test  used  words  picked  randomly  from  the  dictionary  and  from 
assorted  texts.  Another  test  used  words  that  were  carefully  selected  to 
test  all  subcomponents  of  the  spelling  algorithm  with  emphasis  on  the 
splitting  of  middle  consonant  sequences.  The  last  type  of  test  data 
concentrated  on  the  ability  of  the  spelling  algorithm  to  resolve  the 
ambiguities  of  letters  which  typically  have  close  prototype  distances. 

The  input  words  which  represent  a  random  selection  of  valid  English 
words  were  chosen  from  the  dictionary  and  from  a  variety  of  texts  such 
as  newspaper  articles,  fiction,  and  technical  journals.  The  dictionary 
words  were  chosen,  several  words  in  sequence,  from  a  couple  of  random 
pages  for  each  letter  in  the  alphabet.  The  reason  for  picking  several 
words  In  sequence  from  any  page  was  to  observe  the  differences  In 
processing  words  that  were  similar  in  spelling.  Approximately  1,000 
dictionary  words  were  tested  with  no  test  word  rejected  by  the  spelling 
expert.  The  test  words,  selected  from  assorted  texts,  were  chosen  In 
groups  of  three  to  five  consecutive  sentences  (duplicate  words  were 
removed  from  the  sample).  Again,  approximately  1,000  words  were 
tested.  However,  one  test  word  was  rejected  by  the  spelling  algorithm. 
The  proper  name,  McClellan,  was  rejected  because  of  the  beginning 
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consonant  sequence,  mccl.  Further  investigation  showed  that  the 
spelling  algorithm  rejected  a  significant  number  of  proper  names.  The 
rejections  discovered  were  almost  always  for  consonant  sequences  that 
never  seemed  to  appear  in  English  words  that  were  not  proper  names. 
Many  of  the  proper  names  which  were  rejected  belonged  to  geographical 
locations  in  foreign  countries  such  as  Gdansk,  Kwangchow,  or 
Dneprodzerzhinsk.  The  names  of  people  which  were  rejected  were  either 
names  of  people  from  foreign  countries  (e.g.  Khrushchev)  or  names  with 
strong  ancestral  ties  to  a  non-English  speaking  country.  The  spelling 
algorithm  could  not  be  altered  to  accept  these  letter  combinations 
without  disrupting  the  model  of  typical  English  phonemic  structure;  most 
of  the  words  which  were  rejected  are  very  difficult  to  pronounce  for  the 
native  English  speaker.  Also,  the  listing  of  all  these  proper  names  in 
the  exceptions  data  base  would  make  processing  of  the  data  base  too  time 
consuming.  There  are  far  too  many  name  exceptions  for  people,  places, 
and  things  to  be  listed.  The  only  proper  name  that  was  included  in  the 
exceptions  list  the  given  name,  John.  The  hn  consonant  sequence  at  the 
end  of  a  word  is  unusual,  but  that  name  has  very  frequent  occurences  in 
English  text. 

The  next  group  of  tests  on  the  speller  concentrated  on  the 
individual  rules  for  each  of  the  letter  sequence  knowledge  sources.  At 
least  one  word  for  each  of  the  letter  sequences  in  those  knowledge 
sources  was  input  to  the  spelling  expert.  This  procedure  validated  that 


each  of  the  rules  was  applicable  to  valid  English  words.  Another  phase 
of  this  group  of  testing  was  the  input  of  nonsense  words  and  of  letter 
strings  that  purposely  violated  the  natural  English  phonemic  structure. 


One  sequence  of  middle  consonants  which  violated  the  natural  phonemic 
structure  was  identified.  This  structure,  the  tripple-1  (el),  was  later 
corrected  for  by  adding  a  rule  to  the  supplementary  spelling  rule  module 
in  the  spelling  expert  subsystem.  All  the  tested  nonsense  words  which 
did  not  conform  to  natural  phonemic  structure  were  rejected  by  the 
spelling  algorithm. 

The  third  area  of  testing  on  the  spelling  expert  used  data  from  the 
prototype  distance  matrix  of  a  2D-DFT  of  a  simple  character  set.  (See 
appendices  B  and  C.)  Characters  which  were  more  likely  to  be  confused 
were  suggested  as  letter  hypotheses  within  a  word  to  postprocessing 
system.  The  reactions  of  the  spelling  expert  were  then  studied.  For 
instance,  the  lower  case  letters,  c  and  o,  have  very  close  prototypes 
for  the  test  data.  This  information  was  used  to  set  up  competing  word 
hypotheses.  The  input  data  for  a  word  such  as  "color”  might  have  an  "o" 
substituted  for  the  first  letter  to  hypothesize  the  word,  "ooler.”  If 
the  spelling  algorithm  did  not  use  an  exceptions  list  for  letter 
sequences  which  occur  Infrequently,  the  word,  "ooler,”  would  have  been 
accepted  by  the  algorithm.  However,  the  few  words  which  begin  with  a 
double-o  are  easy  to  enumerate  from  a  dictionary  (sequences  in  the 
middle  of  words  are  not  as  easy  to  list).  Therefore,  the  word,  "ooler," 
is  rejected  by  the  spelling  expert.  In  most  cases,  the  spelling  expert 
did  not  need  the  spelling  exceptions  data  base  to  reduce  the  hypothesis 
search  space. 

Although  the  prototype  distance  matrix  will  vary  with  the 
particular  font  being  recognized  and  with  the  recognition  method  being 
used,  the  matrix  indicates  which  letters  are  close  enough  in  resemblance 


to  form  words  that  may  become  sources  of  errors  in  the  postprocessor. 
The  closeness  of  the  letters,  c  and  o,  could  leave  an  unresolved 
ambiguity  between  the  words,  cat  and  oat.  Another  pair  of  letters  which 
are  reasonably  close,  h  and  b,  can  result  in  many  ambiguous  word  pairs 
such  as  hit  and  bit,  hat  and  bat,  or  hay  and  bay.  Considering  another 
close  pair,  g  and  y,  a  combination  of  two  uncertain  letter  positions 
could  cause  a  group  of  four  ambiguous  hypotheses:  bay,  bag,  hay,  and 
hag.  This  is  not  considered  to  be  a  shortcoming  of  the  spelling 
algorithm;  but,  it  does  emphasize  the  need  for  additional  knowledge 
sources  in  the  contextual  postprocessor  to  help  resolve  these 
ambiguities.  Without  the  use  of  additional  contextual  knowledge,  the 
performance  of  the  spelling  algorithm  is  predominantly  determined  by  the 
signal  quality  of  the  input  data  for  those  circumstances  where  the  close 
prototypes  can  fora  word  hypotheses  that  have  equal  likelihood  to  any 
spelling  expert. 

An  interesting  situation  occurs  when  an  error  exists  for  a 
character  which  is  assumed  by  the  thresholds  as  recognized  with 
acceptable  certainty.  This  error  would  be  processed  without  notice  by 
an  unaided  OCR  using  either  a  forced  decision  methodology  or  a 
thresholding  scheme  for  flagging  recognition  rejections.  However,  there 
are  numerous  close  letter  prototypes  that  could  draw  attention  if 
Interchanged  for  the  true  letter.  For  example,  any  of  the  close 
prototypes  which  Involve  one  consonant  and  one  vowel  could  cause  all 
words  in  a  hypothesis  group  to  be  rejected  by  the  spelling  expert  (l.e. 
if  "c”  was  recognized  instead  of  "o"  when  the  vowel  was  the  only  vowel 
in  the  word).  The  postprocessor  has  some  ability  to  recognize  errors 


that  occur  wlthla  the  prescribed  acceptance  threshold  for  character 
recognition.  On  the  other  hand,  those  errors  within  the  recognition 
threshold  which  are  not  recognized  could  propogate  multiple  letter 
errors  within  the  word. 

I 


3.  Are  the  Interfaces  to  the  sentence  parser  and  to  the  OCR  front- 
end  designed  to  easily  accept  a  wide  variety  of  black  box  substitutions? 


The  Interface  requirements  remained  consistent  with  the  original 
design  specification.  The  OCR  front-end  must  supply  a  weighted  list  of 
character  hypothesis  for  each  character  position  within  a  word.  The 
only  difficulty  observed  for  changing  the  OCR  front-end  of  the 
postprocessor  concerned  the  setting  of  thresholds  and  the  setting  of  the 
weighting  factors  used  by  the  parsing  vocabulary  and  short  term  memory 
knowledge  sources.  The  thresholds  can  be  computed  from  historical  data 
about  the  OCR  error  rates.  However,  the  weighting  factors  used  by 
knowledge  sources  for  changing  the  ranking  of  competing  hypotheses  are 
sensitive  to  accuracy  and  the  thresholds  of  the  OCR  front-end.  No 
formal  relationship  could  be  derived  for  the  dependence  of  the  weighting 
factors  on  the  front-end  accuracy.  The  general  relationship  which 
seemed  appropriate  was  an  Inverse  relationship;  use  smaller  weighting 
factors  for  larger  threshold  values  (more  accurate  front-ends).  In  the 
limit  of  front-end  accuracy,  the  inverse  relationship  suggests  that  a 
very  accurate  OCR  should  not  need  the  Influence  of  contextual 
Information  to  Increase  Its  accuracy  or  the  very  Inaccurate  OCR  requires 


the  Influence  of  a  lot  of  contextual  Information  to  improve  accuracy. 


4.  Is  the  Short  Term  Memory  useful  In  Improving  the  accuracy  of  the 


postprocessing  system? 

In  many  test  cases,  the  additional  hypothesis  weighting  attributed 
to  previous  use  of  a  word  was  a  significant  factor  in  the  postprocessor 
chosing  the  correct  hypothesis.  In  the  example  sentence  used  in 
question  one,  the  final  decision  came  down  to  just  a  couple  highly 
ranked  sentences.  Given  an  equally  weighted  pair  of  sentences 
hypothesis  such  as  "The  hat  was  brown."  and  "The  bat  was  brown.",  the 
correct  decision  requires  a  high  level  semantic  understanding  of  the 
text  being  processed.  Without  using  any  of  the  traditional  approaches 
to  story  understanding  (l.e.  building  an  internal  frame  representation 
of  Che  key  concepts  and  relationships  expressed  in  the  text)  the  short 
term  memory  develops  a  dynamic,  domain  specific  knowledge  of  the  text. 
Its  knowledge  base  is  not  as  powerful  as  the  domain  specific  semantic 
processors  because  it  relies  upon  exact  matching  of  key  words  as  opposed 
to  matching  classes  of  words  chat  could  be  inferred  through  a  semantic 
network.  For  example,  an  advanced  semantic  processor  could  have  chosen 
hat  over  bat  by  having  an  expectancy  for  an  item  of  clothing  and 
matching  hat  Co  the  class  of  clothing  items.  The  state  of  the  art  for 
semantic  processing  is  not  sufficient  for  operation  in  an  unrestricted 
subject  domain.  The  short  term  memory  implemented  in  this  postprocessor 
provides  a  few  of  Che  semantic  processing  benefits  without  the  domain 
restrictions  or  processing  overhead  of  traditional  approaches  to 
semantic  processing. 


In  some  situations,  the  semantic  memory  could  incorrectly  influence 
the  postprocessor  by  increasing  the  weighting  of  a  recurring  word  when 
that  word  was  not  the  related  to  the  sentence  currently  being  processed. 
These  errors  are  kept  to  a  small  amount  by  limiting  the  dynamic  portion 
of  the  short  term  memory  to  200  words.  Ideally,  the  200  words  would  be 
distributed  between  words  that  are  preferential  in  the  author's 
vocabulary  and  words  which  are  related  to  specific  subject  of  the  last 
few  paragraphs  of  the  text  being  processed.  Texts  which  have  frequent 
changes  in  subject  matter  and  authors  who  Introduce  frequent  changes  in 
their  writing  style  could  produce  the  circumstances  that  allow  the  short 
term  memory  to  incorrectly  influence  the  postprocessor  decisions.  As 
the  short  term  memory  size  is  increased,  the  likelihood  of  errors  is 
increased  because  the  domain  specialization  developed  by  the  short  term 
memory  is  not  appropriate  for  the  text  currently  being  processed. 


V.  Sunmary,  Conclusions  and  Reconaendatlons 


Summary. 


The  hierarchical  postprocessor  Is  a  suitable  arrangement  for 
Improving  the  recognition  error  performance  of  reading  machines 
operating  in  a  forced  recognition  environment.  The  hierarchical 
architecture  promotes  a  modular  design;  thus,  permitting  easy 
modification  of  the  postprocessor  knowledge  sources  and  the  ability  to 
Interface  OCRs  of  a  wide  variety  of  Identification  methodologies.  Each 
knowledge  source  used  in  the  expert  system  was  helpful  in  resolving  text 
ambiguities  through  Independent  and  cooperative  processings  of  the 
solution  hypotheses. 

The  most  successful  knowledge  source  of  this  contextual 
postprocessor  is  the  unlimited  vocabulary  spelling  algorithm.  The 
algorithm  employed  rules  about  the  sequences  of  consecutive  vowels  or 
consecutive  consonants  that  conform  to  the  natural  phonemic  structure  of 
English  words.  Rejection  of  word  hypothesis  that  do  not  conform  to  the 
rules  of  this  algorithm  resulted  In  a  significant  decrease  in  the 
solution  search  space  that  would  be  processed  by  other  knowledge 
sources  in  the  postprocessor.  The  spelling  algorithm  is  considered  to 
be  vocabulary  unlimited  because  It  Is  liberal  enough  to  accept  some 
words  that  are  not  In  the  American  English  lexicon,  as  long  as  they  have 
the  appearance  and  sound  of  typical  English  syllables.  Therefore,  the 


spelling  algorithm  is  even  capable  of  accepting  words  which  may  be 
coined  in  the  future.  The  algorithm  does  accept  nonsense  word 


hypotheses;  however,  since  those  particular  bad  hypotheses  are  few  in 
number  they  can  be  sorted  out  by  the  other  knowledge  sources  in  the 
expert  system.  The  spelling  algorithm  rejects  a  majority  of  the 
Incorrect  hypothesis  which  permits  more  detailed  processing  to  be 
performed  on  the  remaining  hypotheses. 

Some  of  the  other  knowledge  sources  Implemented  in  the  expert 
system  were  concerned  with  assisting  the  spelling  algorithm  and  with 
processing  unusual  text  situations  such  as  word  contractions  and 
hyphenated  words.  These  knowledge  sources  were  not  Intended  to  be 
Inclusive  of  all  the  knowledge  sources  necessary  to  process  generic 
text,  but  were  intended  to  demonstrate  the  flexability  of  this 
postprocessing  system  for  modularly  adding  spelling  rules,  exceptions  to 
rules,  and  other  specialized  knowledge  sources  that  process  very 
specific  circumstances  that  occur  in  English  text.  Also,  the  format  for 
a  syntactic  parser  compatible  with  unlimited  vocabulary  operation  within 
the  postprocessor  was  defined. 


Conclusions. 


The  accuracy  of  English  sentence  reading  machines  can  be  Improved 
through  a  postprocessing  expert  system  employing  diverse  knowledge 
sources.  Arranging  multiple  knowledge  sources  In  an  hierarchical 
processing  configuration  Is  effective  In  reducing  a  significant  amount 
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of  the  uncertainty  Inherent  to  the  character  identification  process  in 
OCRs.  The  use  of  many  diverse  knowledge  sources  is  also  effective  in 
removing  some  of  the  constiaints  placed  upon  input  texts. 

The  postprocessor  developed  by  this  research  has  demonstrated  that 
a  domain  independent  contextual  postprocessor  is  producible  and  that  it 
can  effectively  Increase  the  accuracy  and  throughput  of  text  reading 
machines.  The  key  element  to  operation  without  domain  dependent 
knowledge  that  would  constrain  the  subjects  of  the  input  text  is  the 
unlimited  vocabulary  spelling  algorithm.  Also  important  to  domain 
Independent  postprocessing  is  the  use  of  wild  card  parts  of  speech  for 
words  which  are  not  in  the  syntactic  parser's  vocabulary. 

Recommendations . 

A  suitable  continuation  of  this  research  effort  could  be  the 
development  of  the  syntactic  parser  needed  for  the  sentence  level 
processing.  Established  designs  for  English  sentence  parsers  could  be 
reconfigured  to  permit  processing  using  wild  card  parts  of  speech.  The 
parsing  mechanism  which  handles  the  words  not  in  the  parsing  vocabulary 
could  also  employ  a  truth  maintenance  scheme  to  Increase  the  parser's 
lexicon  after  selection  of  sentence  hypothesis.  The  truth  maintenance 
process  would  have  to  be  constrained  to  limit  the  processing  time  spent 
on  backtracking  and  second  guessing  conclusions  on  older  sentence 
hypotheses,  especially  in  a  long  document. 

Another  area  suitable  for  future  research  Involves  the  spelling 
algorithm  developed  for  this  postprocessor.  The  spelling  algorithm  may 


be  useful  in  resolving  letter  segmentation  problems.  Its  unique 
properties  of  operating  on  an  unbounded  vocabulary  and  quick  decision 
processing  may  be  very  appropriate  for  reducing  the  identification 
ambiguities  caused  by  letters  merged  together  on  the  printed  page  or 
merged  during  the  optical  digitization  process.  The  algorithm  Itself 
may  not  require  any  modifications.  Instead,  a  group  of  competing  input 
word  hypothesis  could  have  a  variable  length.  For  each  word  hypothesis, 
any  letter  position  could  be  expanded  to  two  or  possibly  three  letter 
positions,  depending  upon  the  particular  set  of  nonsegmented  letters. 
The  weighting  process  would  also  require  adjustment  to  account  for 
multiple  letter  sets  competing  against  single  letters  in  one  group  of 
word  hypotheses. 

This  contextual  postprocessing  expert  system  shows  great  promise 
for  the  basis  of  a  generic  text  reading  machine.  The  general  expansion 
of  this  system  through  the  development  of  additional  domain  independent 
knowledge  sources  that  are  focused  on  removing  the  constraints  presently 
placed  upon  input  text  data  would  be  a  productive  step  toward  the 
development  of  an  accurate,  automated  reading  machine. 


APPENDIX  A 


Postprocessor  Program  Listings  and  Data  Base 

This  appendix  contains  the  listing  for  the  "reader. pas" 
postprocessing  program  and  all  the  source  files  that  are  used  to 
initialize  the  knowledge  sources  in  the  program.  Those  data  base  source 
files  contain  the  beginning  vowel  sequences,  beginning  consonant 
sequences,  middle  consonant  sequences,  middle/end  vowel  sequences, 
completely  vowel  words,  the  200  most  frequent  words,  and  the  list  of 
exceptions  to  the  rule-based  knowledge  sources.  Note  that  in  the  last 
listing,  any  "s"  at  the  end  of  the  word  was  removed  to  comply  with  the 
program  requirements  for  finding  plurals  of  words. 
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(  ***1Hr********************lt***************************************************  ) 
(* 

Post-Detection  Processor 
for 

Expert  System  Reading  Machine 

By:  Capt  David  V.  Paclorkowski  December  1985 

Description:  This  program  accepts  an  Input  file  of  characters  and  their 
weightings  from  an  OCR.  Each  line  of  the  Input  file.  ImagefUe.  contains 
alternating  characters  and  real  numbers  (representing  distance  from  the 
prototype)  for  each  character  position  In  the  text  being  processed.  A 
containing  "9.9*  flags  the  end  of  a  word.  A  line  containing  a  number 
greater  than  10.0  flags  the  end  of  the  text  being  processed.  The  only 
punctuation  presently  handled  by  this  program  are  periods,  question  marks, 
excamatlon  marks,  hyphens,  and  apostrophes.  The  syntactic  parser  Is 
simulated  with  prompts  to  the  operator.  *) 


program  Reader  (Input,  Imageflle,  Vocabulary,  Strange,  Words200,  BegSyl,  HIdSyl, 
EndSyl ,  IsoVowels.  Vowels,  BegVowels,  TempText,  output); 

type  UordStrIng  =  packed  array  [1..20]  of  char; 

Sylable  =  packed  array  [1..4]  of  char; 

Image  =  record 

Character:  char; 

Measure  :  real  (*  prototype  distance  /  belief  measure  *) 
end; 

ImageList  =  array  [1..S]  of  Image;  (*  up  to  5  choices  per  letter  *) 
LetterLIst  =  array  [1..20]  of  ImageList;  (*  array  of  probable  letters  *) 
UordGuess  °  record 

Word:  Wordstring;  (*  possible  word  *) 

Belief:  real  (*  measure  of  belief  for  word  *) 

end; 

WordList  =  array  [1..27]  of  WordGuess;  (*  group  of  possible  words  *) 

WordGroup  =  array  [1..50]  of  WordList;  (*  up  to  50  words/sentence  *) 

Sentence  =  record  (*  assembled  sentence  S  belief  rating  *) 

SentGuess:  array  [1..50]  of  WordStrIng; 

Belief:  real 
end; 

SentLIst  =  array  [1..201  of  Sentence;  (*  array  of  probable  sentences  *) 
SylabPtr  =  "SylList;  (*  pointer  in  linear  list  *) 

SylLlst  •  record  (*  for  a  linear  list  of  sylables  *) 

Syl  :  Sylable; 

Next:  SylabPtr 
end; 

DictPtr  =  “DictWord;  (*  pointer  In  linear  list  *) 

OictWord  =  record  (*  for  a  linear  list  of  vocabulary  words  *) 

Word:  WordStrIng; 

Next:  OlctPtr 
end; 
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var  Threshold  :  array[32. 
Strange, 

Uords200, 

BegSyl , 

MidSyl, 

EndSyl , 

Vowels, 

IsoVowels, 

BegVowels, 

TempText, 

Imageflle  :  text; 

StrangeWordLlst, 

ShortTermList, 

Beg200, 

IsoVowStart:  OictPtr; 
BegSyl Start, 
MIdSylSUrt, 

EndSyl Start, 
BegVowStart, 

Vowel  Start  :  SylabPtr; 
NewSentence;  Sentence; 
Done  :  boolean; 


,126]  of  real; 

(*  file  of  unusually  spelled  words  *) 
(*  file  of  200  most  frequent  words  *) 
(*  file  of  beginning  consonants  *) 

(*  file  of  middle  consonant  letters  *) 
(*  file  of  ending  consonant  letters  *) 
(*  file  of  middle/end  vowel  letters  *) 
(*  file  of  all  vowel  words  *) 

(*  file  of  vowels  begin! ng  words  *) 

(*  temp  storage  of  read  text  *1 
(*  contains  input  data  from  OCR  *) 

(*  points  to  unusual  spellings  *) 

(*  points  to  short  term  memory  *) 
points  to  middle  of  s.t>  memory  *) 
(*  points  to  all  vowel  word  list  *) 

(*  points  to  start  of  a  letter  list  *) 

(*  points  to  start  of  a  letter  list  *1 

(*  points  to  start  of  a  letter  list  *1 

(*  points  to  start  of  a  letter  list  *> 

(*  points  to  start  of  a  letter  list  *1 

(*  next  sentence  being  read  *) 

(*  indicates  end  of  imagefile  *) 


procedure  Initialize; 


var  NewWord,  Temp:  OictPtr; 
NewSyl ,  Temp2:  SylabPtr; 
I:  integer; 


reset  (Imagefile); 

Done  :=  false; 
rewrite  (TempText); 

(* - Thresholds - *) 

(*  Set  thresholds  for  recognizing  character  within  the 

requirements  for  certainty.  Set  to  just  less  than  half  the 
closest  distance  to  next  letter  for  tests  in  this  thesis.*) 


for  I  :=  32  to  126  do 

Threshold  [I]  :»  0.2;  (*  default  thresholds  are  .2  *) 

(*  numerals:  *) 

Threshold[48]  :=  0.08;  Threshold[53]  :=  0.23; 

Threshold[49]  :=  0.00;  Threshold[54]  :=  0.17; 

Threshold[50]  :=  0.29;  Threshold[55]  :=  0.31; 

Threshold[51]  :=  0.20;  Threshold[56]  :=  0.12; 

Threshold[52]  :=  0.19;  Threshold[57]  :=  0.21; 


=  0.23 
=  0.17 
=>  0.31 
=  0.12 
=  0.21 
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(*  upper  case  letters:  *) 

Threshold[65]  :=  0.25;  Threshold[78]  :=  0.21; 

Thresho1d[66]  ;=  0.12;  Threshold[79]  :=  0.12; 

Thresho1d[67]  :=  0.17;  ThresholdCSO]  :»  0.18; 

Thresho1d[68]  :=  0.08;  Thresho1d[81]  :>  0.34; 

Thresho1d[69]  :=  0.19;  Thresho1dC82]  :«  0.18; 

Threshold[70]  :»  0.19;  Threshold[83]  :=  0.22; 

Threshold[71]  :=  0.18;  Threshold[84]  :=  0.21; 

Threshold[72]  :=  0.19;  Threshold[85]  :=  0.18; 

Threshold[73]  :=  0.06;  Thresho1dC86]  :=  0.24; 

Threshold[74]  :=  0.23;  Threshold[87]  :=  0.39; 

Threshold[75]  :=  0.28;  Threshold[88]  :=  0.23; 

Threshold[76]  :*  0.25;  Thresho1dC89]  :»  0.24; 

Threshold[77]  :*  0.36;  Threshold[90]  :=  0.26; 

(*  lower  case  letters:  *) 

Thresho1d[97]  :»  0.23;  Threshold[110]  :=  0.20; 

Threshold[98]  :=  0.17;  ThresholdClll]  :=  0.11; 

Threshold[99]  :=  0.11;  Thresho1dC112]  :=  0.19; 

ThresholdClOO]  :=  0.22;  Threshold[113]  :»  0.17; 

Thresho1d[101]  :=  0.21;  Thresho1d[ll4]  :=  0.21; 

Thresho1d[102]  :•  0.26;  Threshold[115]  :=  0.27; 

Thresho1d[103]  :=  0.16;  Thresho1d[116]  :=  0.25; 

Threshold[104]  :=  0.17;  Thresho1d[117]  :»  0.20; 

Threshold[105]  :=  0.31;  Threshold[118]  :*  0.30; 

Threshold[106]  :=  0.33;  Threshold[ll9]  :•  0.39; 

Thresho1dC107]  :=  0.23;  Threshold[120]  :•  0.34; 

ThresholdClOO]  :*  0.00;  Thresho1d[121]  :»  0.16; 

ThresholdC109]  :»  0.36;  Thresho1dtl22]  :*  0.33; 

r - .......  Short  Terra  Vocabulary  — — - - - 

reset  (Words200); 
new  (ShortTerraList); 

NewWord  :=  ShortTerraList; 
while  not  eof  (Words200)  do 
begin 
I  :=  1; 

MewWord* .Word  : =  '  * ; 

while  not  eoln  (Uords200)  and  (I  <  21)  do 
begin 

read  (Words200,  NewWord*. WordC I]); 

1  :=  1  +  1 
end; 

readin  (Words200); 
new  (Temp); 

NewWord*. Next  Temp; 

NewWord  :■  NewWord* .Next 
end; 

Beg200  :•  NewWord; 

Beg200*.Next  :«  nil;  (*  mark  end  of  linked  list  *) 


(* - Unusual  Spellings - *) 

reset  (Strange); 
new  (StrangeUordLlst); 

NewWord  :*  StrangeWordList; 
while  not  eof  (Strange)  do 
begin 
I  :=  1; 

NewWord* .Word  :=  * 

while  not  eoln  (Strange)  and  (1  <  21)  do 
begin 

read  (Strange,  NewWord*. Word[Il); 

I  :»  I  +  1 
end; 

readin  (Strange); 
new  (Temp); 

NewWord* .Next  :=  Temp; 

NewWord  ;=  NewWord*.Next 
end; 

NewWord*. Next  nil;  (*  mark  end  of  linked  list  *) 

(* - - -  Beginning  Sylables  - *) 

reset  (BegSyl); 
new  (BegSyl Start); 

NewSyl  :=  BegSyl Start; 
while  not  eof  (BegSyl)  do 
begin 
I  1; 

NewSyl *.Sy1  ‘  '; 

while  not  eoln  (BegSyl)  and  (I  <  5)  do 
begin 

read  (BegSyl,  NewSyl *,Syl[ I]); 

I  :=  I  +  I 
end; 

readin  (BegSyl); 
new  (Temp2); 

NewSyl*. Next  :=  Temp2; 

NewSyl  :=  NewSyl*. Next 
end; 

NewSyl*. Next  :=  nil;  {*  mark  end  of  linked  list  *) 


(* - End  Sylables - *) 

reset  (EndSyl); 
new  (EndSyl Start): 

NewSyl  EndSyl Start; 

while  not  eof  ( EndSyl )  do 
begin 
1  :=  1; 

MewSyr.Syl 

while  not  eoln  (EndSyl)  and  (I  <  5)  do 
begin 

read  (EndSyl,  NewSyl *.SyU I]); 

I  :=  I  +  1 
end; 

readin  (EndSyl); 
new  (Temp2): 

NewSyl*. Next  ;=  Tenip2; 

NewSyl  :=  NewSyl ‘.Next 
end; 

NewSyT.Hext  :=  nil;  (*  mark  end  of  linked  list  *) 

(* - Middle  Sylables - *) 

reset  (MldSyl); 
new  ( Ml dSyl Start): 

NewSyl  :=  MldSyl Start; 
while  not  eof  (MldSyl)  do 
begin 
I  :»  1; 

NewSyl *.Syl  :=  *  '; 

while  not  eoln  (MldSyl)  and  (I  <  5)  do 
begin 

read  (MldSyl,  NewSyl* .Sylt I]); 

I  :=  I  +  1 
end; 

readin  (MldSyl); 
new  (TempZ); 

NewSyl*. Next  :=  Terap2; 

NewSyl  :=  NewSyl*. Next 
end; 

NewSyl*. Next  :»  nil;  (*  mark  end  of  linked  list  *) 


* - Vowel  Groupings - *) 

reset  (Vowels); 
new  (Vowel Start); 

NewSyl  :»  Vowel  Start; 
while  not  eof  (Vowels)  do 
begin 
I  :=  1; 

MewSyl*.Sy1  :=  '  '; 

while  not  eoln  (Vowels)  and  (I  <  5)  do 
begin 

read  (Vowels,  MewSyl*.Syl[I]); 

I  I  +  1 
end; 

readin  (Vowels); 
new  (Teinp2); 

NewSyl*. Next  :=  Teinp2; 

NewSyl  :»  NewSyl*. Next 
end; 

NewSyl*. Next  nil;  (*  mark  end  of  linked  list  *) 

k - - - Vowel  Words - — - - *) 

reset  (IsoVowels); 
new  (IsoVowStart); 

NewWord  :=  IsoVowStart; 
while  not  eof  (IsoVowels)  do 
begin 
I  :=  1; 

NewWord* .Word  '  •; 

while  not  eoln  (IsoVowels)  and  (I  <  5)  do 
begin 

read  (IsoVowels,  NewWord*. Word[ I]); 

I  :=  I  +  I 
end; 

readin  (IsoVowels); 
new  (Temp); 

NewWord*. Next  :=  Temp; 

NewWord  :=  NewWord* .Next 
end; 

NewWord'.Next  :=  nil;  (*  mark  end  of  linked  list  *) 


- Beginning  Vowels - *) 

reset  (BegVowels); 
new  (BegVowStart); 

NewSyl  :*  BegVowStart; 
while  not  eof  (BegVowels)  do 
begin 
I  :=  1; 

NewSyl* .Syl  :=■  ' 

while  not  eoln  (BegVowels)  and  (I  <  5)  do 
begin 

read  (BegVowels.  NewSyl*. Syl[I]); 

I  :=  I  +  1 
end; 

readin  (BegVowels): 
new  (Tenp2): 

NewSyl*. Next  :■  TenpZ: 

NewSyl  :»  NewSyl*. Next 
end; 

tlewSyl  *  .Next  :«  nil  (*  mark  end  of  linked  list  *) 


(*  End  of  INITIALIZE  module  *) 


procedure  WordSearch  (Start:  OlctPtr;  TestWord:  WordStrIng;  var  Found:  boolean) 

(*  Utility  to  use  the  parsing  dictionary  and  the  list  of  word  exceptions  to 
the  spelling  algorithm.  Linear  search  of  linked  list.*) 

var  Current:  OlctPtr; 


Current  :=  Start; 

Found  :=  false; 

(* -  Search  until  match  or  end  of  linked  list 

while  (Current  <>  nil)  and  (not  Found)  do 
If  Current*. Word  »  TestWord 
then  Found  :=  true 
else  Current  :»  Current*. Next 


end;  (*  End  of  WORDSEARCH  module  *) 


procedure  ExtraTest  (Word:  UordStrIng;  TestSyl:  Sylable;  var  Valid:  boolean); 

(*  Rule-based  knowledge  source  for  exceptions  and  additions  to  the  main 
spelling  algorithm.  *) 

var  I.  J:  Integer; 

Match:  boolean; 


begin 

wr1te1n( 'ExtraTest: ' .TestSyl ) ; 

Valid  :»  false; 

Match  :=  false; 

I  :=  1; 

(* - Find  the  Matching  Sylable - *) 

while  not  Match  do 
begin 

(* —  Match  the  1st  letter  of  TestSyl  - *) 

If  Word[I]  =  TestSyl [1] 
then  Match  :=  true 
else  I  :»  I  +  1; 

J  :=  2; 

(*—  Match  remaining  letters  of  TestSyl  — *) 
while  Match  and  (J<5)  and  (TestSyUJ]  <>  '  ')  do 
If  TestSyl [J]  =  Word[I+l+J] 
then  J  :»  J  +  1 
else 
begin 

I  :*  I  +  1; 

Match  :=  false 
end 

end; 

(* - Special  Rules - — — *) 

If  (TestSyl  =  'q  ')  and  (Word[I+l]  =  *u’) 
then  Valid  :=  true; 


If  (TestSyl  =  'pn  ')  and  (Word[I+2]  =•  'u')  and  (Word[I+3]  =  'e') 
then  Valid  :=  true; 


If  (TestSyl  =  'ght  ')  and  ((Word[I-l]  =  'u')  and  (Word[I-2]  In  C'a'.'o']) 
or  (Word[I-l]  =  '1')  and  (Word[I-2]  In  ['e'.'b', 
•f.'l'.’m'.'n'.’r'.'s'.'t'])) 


then  Valid  :»  true; 


If  (TestSyl  =  'll')  and  (Word[I+2]  <>  '1') 
then  Valid  :=  true; 


If  (TestSyl  =  'zz')  and  (Word[I+2]  <>  'z') 
then  Valid  :»  true 

(* . *) 


end;  (*  End  of  EXTRA  TEST  module  *) 
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•  "w  - 


procedure  Search  (Start;  SylabPtr;  Word:  WordStrIng;  TestSyl :  Sylable; 
var  Found:  boolean); 

(*  Linear  search  utility  For  spelling  algorithm.  Also  used  to  trigger 
supplementary  spelling  exception  rules.  *) 

var  Current:  SylabPtr; 

begin 

writein  {'Now  searching  for  TestSyl); 

Current  ;=  Start; 

Found  false; 

(* - Search  until  match  or  end  of  linked  list - *) 

while  (Current  <>  nil)  and  (not  Found)  do 
If  Current*. Syl  =  TestSyl 
then  Found  ;=  true 
else  Current  :=  Current*. Next; 

(* -  Also  check  extra  spelling  rules 

If  Found  and  ((TestSyl  =  'q  ')  or 

(TestSyl  =  'ght  ')  or 

(TestSyl  =  'll  ')  or 

(TestSyl  =  'zz  ')  or 

(TestSyl  =  'pn  ')) 

then  ExtraTest  (Word,  TestSyl,  Found) 

(* . 

end;  (*  End  of  SEARCH  module  *) 

^1r****1t******ie**ieit******ii***k*1r*'k*1(it**it*1t*****itir’k*itiriritiHrk*itir*irit**it’kirit*'kii'k*1t1rk*^ 

procedure  SplItSearch  (Word:  WordStrIng;  First,  Second:  Sylable; 

var  Found:  boolean); 

(*  Utility  to  find  appropriate  split-up  of  middle  consonant  sequences;  but 
does  not  correspond  to  pronounced  syllable  separations.  *) 

var  Tempi,  Temp2:  Sylable; 

I,  J,  K,  L;  Integer; 


*) 


*) 


begin 

Found  :*  false; 

Tempi  :=  First; 

Temp2  Second; 

L  :=  3;  (*  counter  for  Tempi  letters  *) 

for  I  :=>  4  downto  1  do  (*  find  start  of  Temp2  blanks  *) 

If  Temp2[I]  •  '  ' 
then  J  :»  I; 


■  1  -  ^ 


^.1  .L.< .. .U-. ».■.  ..n  .  ^ 
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(*  #  of  attempts  =  #  of  blanks  *) 


for  I  J  to  3  do 
If  not  Found 
then 
begin 

for  K  :*  (I-U  downto  1  do  t*  shift  letters  *) 

Tenip2[IC+l]  :=  Tenip2[K]; 

Temp2[l]  :•  TemplCL];  (*  add  new  letter  to  Temp2  *) 

TempUL] 

L  L  -  1;  {*  increment  counter  *) 

if  L  <>  0  (*  check  first  consonant  group  *) 

then  Search  ( Mi dSyl Start,  Word,  Tempi,  Found); 
if  Found  or  (L  =  0)  (*  check  second  consonant  group  *) 

then  Search  (BegSyl Start,  Word,  Tefflp2,  Found) 
end 

end;  (*  End  of  SPLIT  SEARCH  module  *) 

procedure  Apostrophe  (var  Word:  WordString;  I:  integer;  var  Length:  integer; 
var  Valid:  boolean); 

Rule-based  knowledge  source  for  checking  valid  endings  of  apostophe  words 
except  for  the  word,  o'clock,  which  is  in  exceptions  listing.  The  word 
stripped  of  the  apostrophe  and  ending  is  returned  to  the  spelling 
algorithm  for  further  processing.  *) 

var  J:  Integer; 

begin 

Valid  :=  false; 

(* -  For  the  following  cases:  "...s'"  and  "...'s"  and  "...'d"  and 

"...'ll*  and  "...'ve*  and  "...'re" 


if  ((Length  =  I)  and  (Word[I-l]  =  's'))  or 
((Length  »  I  +  1)  and  ((Word[I+l]  =  's')  or 
(Word[I+l]  =  'd')))  or 

((Length  »  I  ♦  2)  and  (((Word[I+l]  =  '!')  and  (Word[I+2]  »  '!'))  or 

((Word[I+l]  =  'v')  and  (Word[I+2]  »  'e'))  or 

((WordCl+1]  =  'r')  and  (Word[I+2]  »  'e')))) 

then 

begin 

Valid  :»  true; 

for  J  :«  I  to  Length  do 


(* - Take  care  of  “...n't"  case - 

If  ((Length  =1+1)  and  (Uord[I-l]  =  'n')  and  (Word[I+l]  =  't')) 
then 
begin 

Valid  :=  true; 
for  J  :=  (I-l)  to  Length  do 
WordCJ]  :=  ■  '; 

Length  ;=  I  -  1 
end; 

(* - *) 

end;  (*  End  of  APOSTROPHE  module  *) 


procedure  Spell  Check  (NewUord:  WordStrIng;  var  HordValld:  boolean); 

(*  Main  spelling  algorithm  -  uses  alternating  sequences  of  vowels  and 
consonants  to  accept  word.  +) 

label  88,  99.  199; 

type  Uppercase  =  set  of  'A'  ..  'Z'; 

Lowercase  =  set  of  'a'  ..  'z*; 

var  Tempi,  Temp2:  Sylable; 

Length,  I,  J,  K;  Integer; 

Found:  boolean; 

Word,  Wordl,  WordZ;  WordStrIng; 

Capitals:  Uppercase; 

Consonants,  Vowels:  LowerCase; 


(• - Initialize  Letter  Sets 

Capitals  :•  ['A'  ..  'Z']; 

Vowels  :-  ['a',  'e',  '1*,  'o',  'u',  'y' 
Consonants  :=  ['a'  ..  'z']  -  Vowels; 


k  k.'U  i.’V  .  -V  -  -N.  1  ^  ^  ^  J  K  ~V*  -1  »  ^  '>  -  - 


(* -  Copy  Word  in  Lower  Case  Letters  - *) 

I  :=  1; 

while  (I  <  21)  and  (NewWord[I]  <>  '  ')  do 
begin 

if  NewUordCI]  in  Capitals 

then  WordCI]  :=  chr(ord( NewUordCI])  +  32) 

else  UordCl]  :=  NewUordCI]; 

I  :=  I  +  1 
end; 

Length  :=  I  -  1;  (*  refflember  word  length  *) 

wri tel n( 'Length  =  '.Length); 

for  1  :=  (Length  +  1)  to  20  do  (*  fill  word  with  blanks  *) 
UordCI]  :=  '  '; 

(* - Any  single  character  word  is  Valid - *) 

if  Length  =  1 

then  UordValid  :=  true; 

(* -  Process  Apostrophe  i  Hyphenated  Words  - *) 

for  1  1  to  Length  do  (*  check  word  for  apostrophes  *) 

begin  (*  and  for  hyphenation  *) 

(* - *) 

if  UordCI]  •  "" 
then 
begin 

Apostrophe  (NewUord,  I,  Length,  UordValid); 
wri  tel  nC  Apostrophe  check  is  '.UordValid); 
if  not  UordValid 
then  goto  99 
else  goto  88 
end; 

(* . *) 

if  UordCI]  = 
then 

begin  (*  Hyphen  Word  Processing  *) 
for  K  :=  1  to  I-l  do 
WordlCK]  :=  UordCK]; 
for  K  :»  I+l  to  Length  do 
Uord2CK]  :>  UordCK]; 

Spellcheck  (Uordl,  UordValid); 
if  UordValid 

then  Spellcheck  (Uord2,  UordValid); 
goto  199 
end 

end; 


(*  exit  for-loop  after  first  apostrophe  found  *) 


{* - Find  a  Vowel  in  Word - *) 

WordValld  :=  false; 
for  I  :=  1  to  Length  do 
If  UordCl]  In  Vowels 
then  WordValld  •.=  true; 
if  not  WordValld 
then 
begin 

writeInCNo  Vowel  in  word.*); 
goto  99 
end 

else  writein  ('Vowels  were  ok.‘); 

(* - Check  for  "s'*  Pluralization . . *) 

if  Word[Length]  *  ‘s' 
then 
begin 

WordCLength]  (*  remove  the  "s"  *) 

Length  :=  Length  -  1  (*  shorten  the  word  *) 

end; 

(* - - - Try  to  match  an  all  Vowel  Word - - - -*) 

WordSearch  (IsoVowStart,  Word,  WordValid); 
if  WordValid 
then  goto  99 

else  WordValid  :=  true;  (*  reset  for  further  processing  *) 


Process  First  Letter  Group  — - — — - - 

Tempi  :*  ■ 

if  (Word[l]  in  Vowels) 

then  (*  collect  first  group  of  vowels  S  verify  correctness  *) 

begin 
I  :=  1; 


while  WordCl]  in  Vowels  do 
begin 

if  I  >  4  (*  no  more  than  4  consecutive  vowels  *) 

then 
begin 

WordValid  false; 

writein  ('More  than  4  vowels  starting  word  ’); 
goto  99 
end; 

TempUl]  :»  Word[I]; 

I  :=  I  +  1 
end; 

Search  (BegVowStart,  Word.  Tempi,  Found) 


(*  collect  first  group  of  consonants  S  verify  correctness  *) 


e1  se 
begin 
I  :=  1; 

while  Word[I]  in  Consonants  do 
begin 

if  I  >  4  (*  no  more  than  4  consecutive  starting  consonants  *) 

then 
begin 

WordValid  :=  false; 

writeln  ('More  than  4  consonants  starting  word'); 
goto  99 
end; 

TemplCl]  :=  Word[I]; 

I  :=  I  +  1 
end; 

Search  (BegSyl Start.  Word,  Tempi,  Found) 
end; 

J  :=  I  -  1;  {*  remember  letter  position  *) 

if  not  Found  (*  verify  first  letter  group  was  valid  *) 
then 
begin 

UordValid  false; 

writeln  ('First  letter  group  failed  '.Tempi,  '|'); 
goto  99 
end; 


(* - Process  Remaining  Sylables - 

while  <}  <  Length  do 
begin 

writeln  ( 'J=' ,J:2, 'Length  *  ',  Length:2); 

Tempi  :=  '  '; 

(* -  Get  Next  Letter  Group  - *) 

if  (Word[J+l]  in  Vowels) 
then 
begin 

I  :=  J  +  1; 

while  Word[I]  in  Vowels  do 
begin 

writeln  ('vowel  is:',  WordtH); 
if  ((I  -  J)  >  4) 
then 
begin 

WordValid  :=  false; 

wr i tel n( 'vowel  group  >4  failed:  ’.Word); 
goto  99 
end; 

TeraplCI-J]  :=  Word[I]; 

I  :■  I  +  1 
end; 

Search  (Vowel Start,  Word,  Tempi,  Found); 

J  :*  I  -  1;  (*  remember  letter  position  *) 


if  not  Found  (*  verify  letter  group  was  valid  *) 
then 
begin 

WordValid  :=  false; 
writeln  ('Vowel  group  not  found'); 
goto  99 
end 
end 

else  (* -  get  next  group,  if  consonants  - *) 

begin 

Tenip2  :=  '  '; 

I  :=  J  +  1;  (*  get  next  letter  position  *) 

while  (Word[I]  in  Consonants)  and  (I  <  21)  do 
begin 

if  (I-J  >  6)  or  (d-J  >  3)  and  (I-l  =  Length)) 
then 
begin 

WordValid  :=  false; 

writeln  ('Cons  group  greater  than  6  letters*'); 
goto  99 
end; 

if  (I  -  J  <  4)  (*  1st  attempt  to  split  consonant  sylables  *) 
then  Teapl[l-J]  :»  Word[I] 
else  Teinp2[l-J-31  :*  Word[I]; 

I  :=  1  +  1 
end; 

if  (I-l  *  Length) 
then 
begin 

Search  (EndSylStart,  Word,  Tempi,  Found); 

J  :=  I  -  1 
end 
el  se 
begin 

Search  (MidSylStart,  Word,  Tempi,  Found); 
if  Found  and  (l-J  >  4)  and  (I-l  <  Length) 

then  (* —  attempt  split  search  if  enough  consonants  - *) 

begin 

Search  (BegSylStart,  Word,  Temp2,  Found); 
writeln  ('Search  Temp2:',  Temp2); 
end 
else 

if  not  Found  and  (I-l  <  Length) 
then 
begin 

writeln( 'SplitSearch  Tempi:',  Tempi,'  and  Temp2:',Temp2); 
SplitSearch  (Word,  Tempi,  Temp2,  Found) 
end; 

J  :»  I  -  1  (*  remember  last  letter  position  *) 

end; 
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procedure  WordExpert  (Count:  Integer;  var  Choices:  WordLlst); 


[*  Lowest  level  control  module.  Used  to  Interface  word  S  letter  level 
knowledge  sources  such  as  the  spelling  algorithm,  the  short  term 
memory,  and  the  parser's  vocabulary.  Additional  rule-based  knowledge 
sources  can  be  added  with  calls  from  this  module.  *) 

var  I,  J,  NewCount,  Position:  Integer; 

Found,  WordValld:  boolean; 


begin 

(* -  Normalize  Belief  Measures  - — *) 

Normalize  (Count,  Choices); 

(* . *) 


(* -  Display  current  choices  - *) 

wrltelnCThe  choices  for  this  word  are:'); 
for  1  1  to  Count  do 

wr1te1n(Cho1ces[I].Word,  '  rated  at  ' ,Cho1ces[I]. Belief); 

* . *) 


NewCount  :>  Count; 

Position  :*  1; 

while  Position  <*  NewCount  do 
begin 

writein  ('  Spell  Checker  Results'); 

Spellcheck  (Cho1ces[Pos1t1on].Word,  WordValld); 
writein  (Cho1ces[Pos1t1on].Word:24,  WordVa11d:8); 
readln; 

if  not  WordValld 

then  (* -  shift  other  choices  foward  to  eliminate  word  - *) 

begin 

NewCount  :=  NewCount  -  1; 
for  0  :*  Position  to  NewCount  do 
Cho1ces[J]  :•  CholcesCJ  *  1]; 

Cho1ces[NewCount  +  l].Word  ’ 
end 

else  (* - Check  Parser's  Dictionary  If  Word  Valid - *) 

begin 

WordSearch  (StartDIct,  Cho1ces[Pos1t1on].Word,  Found); 
writein  ('Dictionary  Check  =  ',  Found:8); 

If  Found 

then  Cho1ces[Pos1t1on]. Belief  :•  CholcesCPosItl on]. Belief  *  l.l; 


{* — —  Check  Short  Term  Memory  If  Word  Valid - *) 

WordSearch  (ShortTermLIst,  Choices[Position].Uord,  Found); 
writeln  ('Short  term  memory  check  >  *.  Found); 
if  Found 

then  Choices[Position].Be1ief  :*  Choices[Position].Be1ief  *  1.1; 
Position  :=  Position  ♦  1 
end 

end; 

if  NewCount  >  0 

then  writeln  ('Threshold  accepted  characters  are  in  Error.'); 

{* - Normalize  Belief  Measures - *) 

Normalize  (NewCount,  Choices); 
writelnCThe  finalized  choices  are:'); 
for  I  :■  1  to  NewCount  do 

writeln (Choicest  I]. Word,  '  believed  at  ', Choicest  I], Belief); 
end;  (*  End  of  WORD  EXPERT  control  module  *) 

procedure  GetLetters  (var  NextLetters:  ImageList;  var  EndWord:  boolean); 

(*  Interface  to  input  data  from  OCR.  Assumes  the  depth  of  competing 
characters  has  been  limited  to  five  choices.  *) 

var  I:  integer; 

Ch:  char; 

Number:  real; 

Stop:  boolean; 


begin 
I  1; 

Stop  false; 

while  not  eo1n( Imagefile)  and  not  EndWord  and  not  eof(Imagefile)  and 
not  Stop  do 
begin 

read  (Imagefile,  Ch,  Number); 

writelnCThe  image  input  character  is  ',Ch,’  with  vector  distance  •  ', 
Number); 

if  (Number  >•  9.8) 
then 
begin 

EndWord  :=  true; 
if  (Number  >■  10.0) 
then  Done  true 
end 
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procedure  ConnectLetters  (LetterCholce:  LetterLIst;  var  Combos:  Integer; 

var  WordCholces:  WordLlst); 

(*  Utility  to  perform  letter  coad>1naffletr1cs  up  to  3  uncertain  characters 
per  Mord.  Called  by  the  GetUords  control  module**) 

var  I,  J,  K,  L,  M,  X.  Y,  Counter:  Integer; 

(* - Initialization  Loop - *) 

begin 

for  I  :*  1  to  27  do 
begin 

MordCho1ces[I].Word  '  '; 

WordCholcesC I]. Belief  :=  0.0 
end; 

I  ;=  1; 

Combos  :*  1; 

(♦ -  Count  #  of  Combinations  - *) 

while  (I  <  21)  and  (LetterCho1ce[I][l].Character  <>  '  ’)  do 
begin 

If  LetterCholceC I ][!]. Measure  <  l.O 
then 
begin 
J  :*  1; 

while  ( LetterCholceC I ][J]. Character  <>  '  ')  do 
J  :=  J  ♦  1; 

Combos  :*  Combos  *  (J  -1); 
end; 

I  :*  I  +  I 


I  1; 

(* - Assemble  Non-Variable  Words - *) 

If  Combos  >  1 
then 
begin 

WordCho1cesC2].Word[l] 

v;h11e  (I  <  21)  and  (LetterCho1ce[I]Cl].Character  <>  '  ')  do 
begin 

WordCho1ces[l].Word[I]  LetterCho1ce[I][l].Character; 

I  :>  I  ♦  1 
end; 

WordCholcesCl]. Belief  :*  1.0 
end 

{* - Assemble  Words  from  Letter  CooVsInatlons - *) 

el  se 
begin 

Counter  0; 

while  {1  <  21)  and  (LetterCholceCHtU.Character  <>  '  ')  do 
begin 

(* —  Copy  Sure  Letters  to  All  Cooibos  — *) 
if  (LetterC!ioice[l][l].l‘tasurc  =  l.C) 
then 

for  J  ;•  1  to  Combos  do  (*  fill  array  with  sure  letters  *) 
begin 

WordCho1cesCJ].Word[l]  :=  LetterCho1ce[l][l].Character; 
WordCholcesCJ]. Belief  :«  WordCho1ces[J]. Belief  +  1 
end 

(*—  Insert  Unsure  Letters  Into  Appropriate  Positions  — *) 
else 
begin 

(*—  Collect  Info  for  Letter  Positioning  — *) 

Counter  :*  Counter  +  1; 

J  :=  2;  (*  start  at  2nd  letter  *) 

while  (LetterCho1ce[l][J].Character  <>  ’  ')  do 


J  :=  J  ♦  1;  (*  J  -  1  »  #  choices  for  this  unsure  letter  *) 

K  :=  Combos  div  (J  -  1);  (*  K  »  #  choices  -  other  unsure  letters  *) 
If  Counter  *  1 

then  X  ;»  J  -  1;  (*  remember  for  latter  arrangements  *) 

If  Counter  *  2 

then  Y  :•  J  -  1;  (•  remember  for  latter  arrangements  *) 

(* - *) 

If  (Counter  =  1)  (*  fill  array  for  1st  unsure  letter  *) 

then 


for  M  :»  0  to  (J  -  2)  do 
for  L  :«  (M  *  K  ♦  1)  to  ((M  ♦  1)  *  K)  do 
begin 

WordCholcesCLl.WordCl] 

LetterCho1ce[l][M  +  Ij.Character; 
WordCho1ces[L]. Belief  :*  WordCho1ces[L]. Belief  + 

LetterCho1ce[l][M  +  ll.Measure 

end; 


•J***  - 
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(*  fin  array  for  2nd  unsure  letter  *) 


If  (Counter  =  2) 
then 

for  M  :=  0  to  (J  -  2)  do 
for  L  :=  (M  *  K  +  1)  to  ((M  ♦  1)  *  K)  do 
begin 

UordCholcesCL] .Uord[ I ] : * 

LetterCholceC I][L  modlJ-D'i-l] .Character ; 
UordCholcesCL]. Belief  :•  UordCholcesCL] .Belief  * 
LetterCholceC I]CL  iaod(J-l)'»^l]. Measure 

end; 

if  Counter  =3  (*  fill  array  for  3rd  unsure  letter  *) 

then 

for  M  :»  0  to  {X  *  (J-1)  -  1)  do 
for  L  1  to  Y  do 
begin 

UordChofcesCM*Y+L].MordCl]:» 

LetterCholceC  I  ]CM  nudiJ-l)-*^!]. Character; 
UordCho1cesCM*Y'«-L]. Belief  :=  UordCholcesCM^Y-fL]. Belief  + 
LetterCholceC  I  ]CM  modi  J-1  in]. Measure 
end 

end;  (*  end  of  else  clause  *) 

I  :»  1  +  1 
end 

end; 

UordCholcesCCofflbos  +  l].UordCl] 

(* . ♦) 

end;  (*  End  of  CONNECT  LEHERS  module  *) 

procedure  GetUords  (var  UordChoices:  UordLlst;  var  EndSentence:  boolean); 

(*  Control  module  to  Interface  OCR  and  to  set  Initial  depth  of  search 
by  using  the  OCR  recognition  thresholds.  *) 

var  LetterCholces:  LetterLIst; 

I,  Combos:  Integer; 

NextLetter  :  ImageLlst; 

EndWord:  boolean; 


begin 
I  :•  1; 


EndUord  :*  false; 
while  not  EndUord  and  (I  <  21)  do 
begin 

GetLetters  (NextLetter.  EndUord); 
LetterCholcesCi]  :*  NextLetter; 

I  I  +  1 


■» 


if  LetterChoices[I-2]Cl].Character  in  '!' 

then  EndSentence  :■  true; 

ConnectLetters  (LetterChoices,  WordChoices.  Combos); 


] 


(*—  Process  words  with  uncertain  letters  — *) 
if  (Combos  >  1) 

then  WordExpert  (Combos.  WordChoices) 
end;  (*  End  of  GET  WORDS  control  module  *) 

(iHt*«**lK-»*»»*l>»»*****»**»»*»»»»«»***»********»**»**»*****«*»lt«***«iH>»**<nnn.t 

procedure  ConnectWords  (NewWords:  WordGroup;  var  NewSentences :  SentList); 

(*  Utility  for  performing  sentence  level  combinametrics.  Limited  to 
three  uncertain  words  per  sentence.  Called  by  GroupWords  module.*) 

var  I,  J,  K,  L,  M.  X.  Y,  Combos,  Counter:  integer; 


begin 

Combos  1; 

I  :=  1; 

while  (I  <  50)  and  (MewWords[I][l].Word[l]  <>  '  ')  do 
begin 

if  NewWords[l]Cl]. Belief  <  1.0 
then 
begin 
J  :=  1; 

while  (NewWordsCnCJl.WordCl]  <>  ’  ')  do 
J  :»  J  ♦  1; 

Combos  :=  Combos  *  (J  -1) 
end; 

I  :=  I  +  1 
end; 

I  :=  1; 
if  Combos  »  1 
then 

while  (I  <  50)  and  (NewWords[I]Cl].Word[l]  <>  ’  ’)  do 
begin 

NewSentences[l].SentGuess[l]  :=  NewWords[lZ[l].Word; 


Counter  0; 

while  (1  <  50)  and  (NewWordsCl]Cl].Uord[l]  <>  '  ')  do 
begin 

If  (NewWords[I][l].Be11ef  >  1.0) 
then 

for  J  :■  1  to  Combos  do 
begin 

NewSentences[J].SentGuessCll  :■  NewWords[I][l].Word; 
NewSentencesCJ]. Belief  ;=  ftewSentences[J]. Belief  ♦  I 
end 

el  se 
begin 

Counter  :=  Counter  1; 

J  :*  2; 

while  (MewWords[I][JZ.Word[l]  <>  '  ')  do 
J  ;•  J  +  1; 

K  Combos  div  (J  -  I); 
if  Counter  »  1 
then  X  :*  J  -  1; 
if  Counter  •  2 
then  y  :»  J  -  1; 

If  Counter  =  1 
then 

for  H  ;*  0  to  (J  -  2)  do 
for  L  :*  (M  *  K  +  1)  to  ((H  ♦  1)  *  K)  do 
begin 

NewSentences[L].SentGuess[I]  :■  NewUordsCl][N  l].Word 
NewSentencesdl. Belief  :■  NewSentences[L]. Belief 
NewWords[I][M  +  l].Bel1ef 

end; 

If  Counter  »  2 
then 

for  M  :>  0  to  (J  -  2)  do 
for  L  :*  (M  *  K  ♦  1)  to  ((M  ♦  1)  *  K)  do 
begin 

NewSentencesCL}.SentGuess[I]  :* 

NewUordsCllCL  inod(J*l)'*’l].Uord; 
NewSentences[L].Be11ef  :■  MewSentences[L]. Belief  + 
NewWords[n[L  niod(J*l)+l].Be11ef 


T  '  .  /  •  f. '  ’  i  ’  'vi— 


if  Counter  *  3 
then 

for  M  :=  0  to  (X  *  (J-1)  -  1)  do 
for  L  :•  1  to  Y  do 
begin 

NewSentencesCN*Y4-L].SentGuessCl]  := 

NewWords[I][M  modi  J-1  Word; 
NewSentences[M*Y+L].Belief  :»  MewSentences[H*Y+L]. Belief  + 
NewWords[I][M  nwd( J-1  i-*-!]. Belief 
end 

end; 

I  :=  I  +  1 
end; 

for  L  :=  1  to  Combos  do 

NewSentencesCL]. Belief  NewSentences[L]. Belief  / 

(Combos  *  (I  -  1)  -  (Counter  *  (Combos  -  1))) 

end; 

NewSentencesCCombos  *  l].SentGuess[l]Cl]  :*  ‘  * 
end;  (*  End  of  CONNECT  LEHERS  module  *) 

procedure  GroupWords  (var  NewSentences:  SentList); 

(*  Mid-level  control  module.  Used  to  interface  word  level  processing  to 
sentence  level  processing.  If  a  upper/lower  case  knowledge  source  is 
added  to  this  system,  it  may  need  the  I>1  information  from  this  module 
to  mark  the  start  of  words.  *) 

var  NewWords:  WordGroup; 

WordChoices:  WordList; 

I:  integer; 

EndSentence:  boolean; 


begin 
I  :=  1; 

EndSentence  :=  false; 
while  not  EndSentence  do 
begin 

GetWords  (WordChoices,  EndSentence); 

NewWordsCi]  :=  WordChoices; 

I  I  ♦  1; 
end; 

if  I  <  50 

then  NewWords[I][l].Word[l]  :=  ’  ’; 

ConnectWords  (NewWords,  NewSentences)  (*  makes  all  sentences  combos  *) 
end;  (*  End  of  GROUP  WORDS  control  module.  *) 
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procedure  ParseSentence  (var  NewSentences:  SentLIst; 

var  FinalCholce:  Sentence); 

(*  This  module  simulates  the  syntactic  parser  of  this  postprocessing  system. 
It  relies  on  the  operator  to  perform  the  grammar  validation.  If  a  word 
Is  not  In  the  parser's  vocabulary.  It  may  take  on  any  part  of  speech 
necessary  to  form  a  graimnatlcally  correct  sentence.  This  module  also 
selects  the  f1iio1  sentence  solution;  It  assur.ies  sentence  hypotheses  are 
In  order  of  preference.  *) 

var  I.  J,  K:  Integer; 

Answer:  char; 

Selected:  boolean; 


begin 

Selected  :•  false; 

writein  ('  ':20,  'Sentence  Parser  Simulation'); 
writein; 

I  :>  1; 

while  (NewSentences[I].SentGuessCl]Cl]  <>  '  ')  and  not  Selected  do 
begin 
J  :=  1; 

writein  ('Does  this  sentence  parse?  (y/n)'); 

writein  ('Allow  unknown  words  to  take  on  any  necessary  part  of  speech.'); 
repeat 

If  (NewSentences[I].SentGuess[J]  <>  '  ') 
then  write  (NewSentencesCU.SentGuessCJ]) 
el  se  wri te  ( '  ' ) ; 

J  :=  J  +  1 

until  (MewSentences[I].SentGuess[J  -  1][1]  In  ['  ',  '?',  '!']); 

writein; 

readin  (Answer); 

If  (Answer  =  'y') 
then  Selected  :»  true; 
else  I  :=  I  +  1 
end; 

If  Selected  •  true 

then  FinalCholce  NewSentences[I] 

else  FinalCholce  :=  NewSentencesCl] 

end;  (♦  End  of  PARSE  SENTENCE  module  *) 
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(*********<►*******************■•*•**************************************■******) 
procedure  StoreText  (var  FInalChoice:  Sentence;  var  TempText:  text); 

(*  Utility  used  to  record  the  final  sentence  choice  In  an  output  file.  *) 

var  I,  J:  Integer; 

begin 

I:»  1; 
repeat 

for  J  1  to  20  do 

If  (FInalChoice. SentGuess[I][J]  <>  '  ') 

then  write  (TempText,  FInalChoice. SentGuess[l][J]) 

else  write  (TempText,  ‘  '); 

I  :=  I  +  1 

until  (FInalChoice. SentGuessCI  -  1]E1]  In  ['  *,  '.*,  '!']); 

writein  (TempText) 

end;  (*  End  of  STORE  TEXT  module  *) 


procedure  RankSentences  (var  NewSentences:  SentLIst); 

(*  Utility  used  to  normalize  the  sentence  belief  space  and  to  place 
sentences  In  order  of  preference  prior  to  parsing.  *) 

var  I,  J,  K:  Integer; 

Sum:  real; 

Temp:  SentLIst; 

begin 
I  :=  1; 

Sum  0.0; 

while  (NewSentencesCn.SentGuessCl]Cl]  <>  ’  ')  do 
I  :=  I  +  1; 

If  I  =  2 

then  MewSentencesCl]. Belief  :=  1.0 
el  sc 

If  I  »  1 

then  wri tel n( 'ERROR:  No  Sentence  Parsed.') 
else 
begin 

for  J  :=  1  to  (I  -  1)  do 
Sum  :=  Sum  +  NewSentences[J].8e11ef; 
for  J  :=  1  to  I  do 

NewSentencesCJ]. Belief  :=  NewSentencesCJ].Bel lef  /  Sum; 
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for  J  :=  1  to  (I  -  1)  do 
begin 

for  K  ;=  (J  +  1)  to  (1  -  1)  do 
TerapCJ]  :=  NewSentencesCJ]; 
if  (MewSentences[K]. Belief  >  Temp[J].Be1 ief) 
then 
begin 

TeinpCJ]  :=  NewSentences[K]; 

NewSentences[K]  :=  NewSentencesCJ] 
end 

end; 

NewSentences  :=  Temp 
end 

end;  (*  End  of  RANK  SENTENCE  module  *) 

procedure  Remember  (FinalChoice  :Sentence); 

(*  Utility  used  to  update  the  short  term  memory  and  the  parsing  vocabualary 
after  a  sentence  hypothesis  is  chosen.  Since  the  parsin  dictionary  is 
simulated,  only  a  reminder  to  the  operator  is  displayed  for  that  portion 
of  the  update.  The  dynamic  portion  of  the  short  terra  meraory  is  limited  to 
200  additional  words.  *) 


var  I,  Count:  integer; 

Current,  Previous,  Temp:  DictPtr; 
Found:  boolean; 


begin 

writeln  ('Please  include  all  new  words  from  chosen  sentence  into  ', 
'parser  dictionary.'); 

I:=  1; 
repeat 

Found  false; 

WordSearch  (FinalChoice. SentGuessCi],  Found); 
if  not  Found 
then 
begin 

new  (Temp); 

Temp*. Word  :=  FinalChoice. SentguessCi]; 

Temp*. Next  :=  BeginBOO; 

Begin200  :=  Temp; 


(*—  remove  duplicates - *) 

Found  :»  false; 

Previous  Beg1n200; 

Current  :=  Previous* .Next; 
while  (Current  <>  nil)  and  (not  Found)  do 
if  Current* .Word  =  FinalChoice.SentGuess[I] 
then 
begin 

Previous*. Next  :=  Current* .Next; 
dispose  (Current) 
end; 

(*—  Limit  to  200  words - *) 

Count  :=  1 
Current  :=  Beg1n200; 

while  (Count  <  201)  and  (Current*. Next  <>  nil)  do 
begin 

Current  :=  Current*. Next; 

Count  :=  Count  +  1 
end 

if  Current*. Next  <>  nil 
then 
begin 

dispose  (Current*. Next) 

Current* .Next  :=  nil 
end 
end 

I  :«  I  +  1 

until  (FinalChoice.SentGuessCI  -  !][!]  in  [*  '!’]); 

end;  (*  End  of  the  REMEMBER  module  *) 

procedure  MakeSentence  (var  FinalChoice;  Sentence;  var  TempText:  text); 

(*  Higher  level  control  module.  Used  to  interface  sentence  level 

rule-based  knowledge  sources  such  as  the  syntactic  parser.  Also  used 
to  update  the  parsing  vocabualry  and  the  short  term  memory  after  each 
sentence  in  the  text  completes  processing.  *) 

var  NewSentences:  SentList; 

begin 

GroupWords  (NewSentences); 

ParseSentence  (NewSentences); 

Rank Sentences  (NewSentences); 

Remember  (FinalChoice) 

end;  (*  End  of  MAKE  SENTENCE  control  module  *) 


(*******'**'***«**«*«***«'*****'**«*****'*****************'************************  I 

procedure  Printout  (var  TempText:  text); 

I*  Utility  used  to  display  copy  of  output  text  file  to  terminal.  *) 
var  Ch:  char; 

begin 

wrlteln  ('The  completed  script  Is:'); 
reset  (TempText); 
vihlle  not  eof(TempText)do 
begin 

while  not  eo1n( TempText)  do 
begin 

read  (TempText.  Ch); 
write  (Ch) 
end; 

readin  (TempText); 
wrlteln 
end 

end;  {*  End  of  PRINTOUT  module  *) 

begin  (‘MAIN  PROGRAM*) 

Initialize; 

while  not  Done  do 
begin 

MakeSentence  (NewSentence); 

StoreText  (NewSentence,  TempText) 
end; 

Printout  (TempText) 
end.  (*  MAIN  PROGRAM  *) 


200  Most  Frequent  Words 


tlie 

we 

man 

get 

however 

of 

him 

me 

here 

home 

and 

been 

even 

between 

snail 

to 

has 

most 

both 

found 

a 

when 

make 

life 

thought 

In 

who 

after 

being 

went 

that 

will 

also 

under 

say 

is 

more 

did 

never 

part 

was 

no 

many 

day 

once 

he 

If 

before 

same 

general 

for 

out 

must 

another 

high 

It 

so 

through 

know 

upon 

with 

said 

back 

while 

school 

as 

what 

years 

last 

every 

his 

up 

where 

might 

don't 

on 

Its 

much 

us 

does 

be 

about 

your 

great 

got 

at 

Into 

way 

old 

united 

by 

than 

well 

year 

left 

1 

them 

down 

off 

number 

this 

can 

should 

come 

course 

had 

only 

because 

since 

war 

not 

other 

each 

against 

until 

arc 

new 

just 

go 

always 

but 

some 

those 

came 

away 

from 

could 

people 

right 

something 

or 

time 

how 

used 

fact 

have 

these 

too 

take 

though 

an 

two 

little 

three 

water 

they 

may 

state 

states 

less 

which 

then 

good 

himself 

public 

one 

do 

very 

few 

put 

you 

first 

make 

house 

think 

were 

any 

world 

use 

almost 

her 

my 

still 

during 

hand 

all 

now 

own 

again 

enough 

she 

such 

see 

without 

far 

there 

like 

men 

place 

took 

would 

our 

work 

amerlcan 

head 

their 

over 

long 

around 

yet 

Exceptions  to  Rules 


aardvark 

eelgras 

klaxon 

oologic 

qivuit 

aardwolf 

eelpout 

kleenex 

oologlcally 

qoph 

aeolland 

eelworn 

kleptomania 

oologist 

queue 

aeolic 

een 

kleptomaniak 

oology 

queuing 

aeolipile 

eer 

klondlke 

oolong 

schlemiel 

aeon 

eerie 

kloof 

oomiak 

schlep 

aeonlan 

eighth 

klutz 

oomph 

schlieren 

aorlst 

eocene 

klystron 

oophorectomie 

schlimazel 

aorta 

eohippu 

kohl 

oophorectomy 

schlock 

aoudad 

eolian 

kohrabi 

oophoriti 

schmaltz 

archaeology 

eolith 

kohrabie 

oophyte 

schmeer 

archaeopteryx 

eolithlc 

kvass 

oosperm 

schmo 

archaeozoic 

eon 

kvetch 

oospore 

schmoose 

ay 

eonlan 

kvetche 

oosporic 

schmuck 

ayah 

eosin 

kvetched 

oosporou 

schnapper 

aye 

eosinophil 

kvetching 

ootheca 

schnapps 

ayin 

equiangular 

kwacha 

oothecae 

schnecken 

bdellium 

eyas 

llama 

ootid 

schnitzel 

borscht 

eyra 

llano 

ooze 

schnook 

buhl 

eyrie 

marshmallow 

oozed 

schnorrer 

buhrston 

eyrir 

o’clock 

oozier 

schnozzle 

chchonian 

eyry 

oedema 

ooziest 

schwa 

chthonic 

fahrenheit 

oedipal 

oozily 

seeing 

ctenldlum 

faience 

oedipean 

oozines 

sixth 

ctenoid 

fellmonger 

oedipu 

oozing 

sphragistic 

ctenophore 

fifth 

oenology 

oozy 

thousandth 

czar 

f  jeld 

oenomel 

oyer 

tsar 

czaravitch 

fjord 

oer 

oyez 

tsetse 

czaravna 

gooey 

oersted 

oyster 

tsimme 

czardas 

grosz 

oesophagus 

phlebitis 

tsunami 

czarism 

hundredth 

oestrogen 

phlebotomy 

tsuri 

Czech 

iactric 

oestu 

phlegm 

tsutsugamushl 

Czechoslovakia 

iactrical 

oeuvre 

phlegmatic 

twelfth 

Czechoslovakian 

iamb 

oocyte 

phloem 

tzar 

delft 

lambic 

oodles 

phlogistic 

tzimme 

delf tware 

lambu 

oogamete 

phlogopite 

tzuri 

dhak 

Iasi 

oogamie 

phlogostron 

ultlander 

dharma 

iatrogenic 

oogamou 

phlox 

yacht 

dhole 

John 

oogamy 

qadi 

yield 

dhoti 

khaki 

oogenesi 

qaf 

zwieback 

dhow 

khan 

oogenia 

qaid 

zwitterion 

d jinnl 

khat 

oogenium 

qaslda 

d jinny 

klan 

oolite 

qat 

eau 

klansman 

oolith 

qibla 

klavern 

oolitic 

qintar 

APPENDIX  B 


Fourier  Distance  Matrix  for  Simple  Character  Set 

This  appendix  contains  the  matrix  of  distances  between  the 
character  feature  sets  for  the  characters  shown  in  Appendix  C.  Each 
feature  set  was  measured  using  the  three  lowest  harmonics  of  a  2D-FFT  on 
a  16  by  16  pixel  Image,  using  the  program  in  Appendix  D. 
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APPENDIX  C 

Test  Set  of  Siaple  Characters 

This  appendix  contains  the  character  set  that  was  used  to  pick  the 
groups  of  characters  which  would  compete  in  a  word  hypothesis  for  the 
same  character  position.  The  character  set  was  made  of  simple  block 
style  characters  to  represent  a  font  which  has  little  separation  between 
characters  in  the  n-dimensional  space  of  the  feature  set  representation 


APPENDIX  D 


2D-FFT  Pro^raa  Listing 

This  appendix  list  the  pascal  prograia  used  to  perform  a  two- 
dimensional  discrete  fourier  transform  on  the  character  set  found  in 
Appendix  C.  The  program  was  also  used  to  produce  the  distance  matrix 
found  in  Appendix  B.  The  program  was  written  in  Turbo  Pascal  to  run  on 
an  IBM  compatible  microcomputer. 


(* 

TWO-DIHEMSIONAL  DISCRETE  FOURIER  TRAUSFORM 


This  program  performs  two-dimesional  discrete  fourier  transforms, 
20-DFTs,  on  digitized  arrays  of  16  by  16  points.  The  input  data  is 
located  in  a  file  named.  Pixels.  This  file  is  a  list  of  real  numbers, 
with  one  entry  per  line.  The  output  file.  Prototypes,  is  a  matrix 
of  distances  between  each  of  the  16  x  16  arrays  processed.  The  62 
arrays  processed  represent  a  digitized  set  of  upper  case  letters, 
lower  case  letters,  and  the  numerals,  0  though  9. 


program  LetterDFT  (Pixels,  Prototypes); 


const  NumChar  =  62; 


type  OneDArray  =  array  [1..16]  of  real; 

TwoDArray  =  array  [1..16,  1..16]  of  real; 

ProType  =  array  [1..49]  of  real; 

AllTypes  =  array  [1.. NumChar]  of  ProType; 

DistArray  =  array  [1. .NumChar,  1.. NumChar]  of  real; 


var  Reals,  Imaginary:  TwoDArray; 
Filtered:  ProType; 

CharDFT:  AllTypes; 

Matrix:  DistArray; 

I,  J,  K,  Line:  integer; 
Infile,  Outfile;  text; 


(*  The  procedures  OFT  and  PFA  perform  a  20-OFT  by  making  two  passes  of  a 
single  dimension  DFT  with  a  transposition  of  rows  and  columns  between 
passes.  These  procedures  were  adapted  to  run  on  Turbo  Pascal  from  a 
prime  factor  program  developed  in  Fortran  by  C.  S.  Burrus  at  Rice 


University  (24:  127-135). 


►) 


procedure  DFT  (var  X,  Y:  TwoDArray); 


var  XI,  Yl:  OneDArray; 
J,  K:  integer; 


(***************♦*************************♦******♦****•***♦*****) 
procedure  PFA  (var  A,  B:  OneDArray); 


const 


C161  =  0.70710670118654 
C162  =  0.33268343236510 
C163  =  1.30656296448763 
C164  =  0.54119610014619 
C165  =  0.92387953251120 
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var  Rl,  R2,  R3.  R4,  R5.  R6.  R7,  R3,  R9,  RIO,  Rll,  R12.  R13,  R14, 
RIS,  R16.  Tl,  T2.  T3.  T4.  T5.  T6.  T7.  T8.  T9.  TIO,  TU. 

SI.  S2,  S3,  S4.  S5,  S6.  S7,  S8,  S9,  SlO.Sll,  S12,  S13,  S14, 
SIS,  S16,  Ul.  U2.  U3.  U4,  US.  U6,  U7.  U8,  U9,  UlO,  Ull:  real 

begin 

{* - real  EQUATIONS . *) 

Rl  :*  A[l]  ♦  A[9]; 

R2  :=  A[l]  -  A[9]; 

R3  A[2]  ♦  A[10]; 

R4  ;=  A[2]  -  A[10]; 

R5  :=  A[3]  +  ACll]; 

R6  :=  A[3]  -  A[ll]; 

R7  :=  A[4]  +  A[12]; 

R8  :=  A[4]  -  A[12]; 

R9  :=  ACS]  +  A[13]; 

RIO  :=  A[5]  -  A[13]; 

Rll  :=  A[6]  +  A[14]; 

R12  :=  A[6]  -  AC14]; 

R13  ;=  A[7]  +  A[1S]; 

R14  :=  A[7]  -  AC15]; 

RIS  :=  A[S]  +  AC16]; 

R16  :=  A[8]  -  A[16]; 

Tl  :=  Rl  +  R9; 

T2  :=  Rl  -  R9; 

T3  :=  R3  +  Rll; 

T4  :*  R3  -  Rll; 

TS  :=  RS  +  R13; 

T6  :*  RS  -  R13; 

T7  :=  R7  +  RIS; 

T3  :=  R7  -  RIS; 

T9  :=  (T4  +  T8)  *  C161; 

TIO  :=  (T4  -  T8)  *  C161; 

Rl  :=  Tl  +  TS; 

R3  :=  Tl  -  TS; 

RS  ;=  T3  +  T7; 

R7  :=  T3  -  T7; 

R9  :»  T2  +  TIO; 

Rll  :=  T2  -  TIO; 

R13  :=  T6  +  T9; 

RIS  :=  T6  -  T9; 

Tl  :=  R4  +  R16; 

T2  :=  R4  -  R16; 

T3  :=  (R6  +  R14)  *  C161; 

T4  :=  (RS  -  R14)  *  C161; 

TS  :=  R8  +  R12; 

T6  :=  R8  -  R12; 

T7  :•  C162  *  (T2  -  T6); 


T9  :=  C164  *  T6  -  T7; 
no  :=  R2  +  T4; 

Til  :=  R2  -  T4; 

R2  :=  TIO  +  T8; 

R4  :=  TIO  -  T8; 

R6  ;=  Til  +  T9; 

R8  :=  Til  -  T9; 

T7  :»  C165  *  (T1  +  T5); 
T8  :=  T7  -  T1  *  C164; 

T9  :=  T7  -  T5  *  C163; 

TIO  :=  RIO  +  T3; 

Til  :=>  RIO  -  T3; 

RIO  :=  TIO  +  T8; 

R12  :=  TIO  -  T8; 

R14  :=  Til  +  T9; 

R16  Til  -  T9; 


.  IMAGINARY  EQUATIONS 

51  :*  B[l]  B[9]; 

52  ;=  B[l]  -  B[9]; 

53  :*  B[2]  +  B[10]; 

54  :»  B[2]  -  B[10]; 

55  :«  B[3]  +  B[ll]; 

56  :=  B[3]  -  B[ll]; 

57  :=  BC4]  +  BC12]; 

Sa  ;=  B[4]  .  B[12]: 

S9  :»  BC5]  +  BC13]; 

510  :*  B[5]  -  B[13]: 

511  :»  B[6]  +  B[14]; 

512  :=  BC6]  -  B[14]; 

513  :=  B[7]  +  B[15]; 

514  :=  BC7]  -  B[15]; 

515  :=■  B[8]  ♦  B[16]; 

516  :=  BC3]  -  BC16]; 

U1  :*  SI  +  S9; 

U2  :=  SI  -  S9; 

U3  :»  S3  +  Sll; 

U4  :»  S3  -  Sll; 

U5  :=  S5  +  S13; 

U6  :=  S5  -  S13; 

U7  :»  S7  +  SIS; 

U8  :=  S7  -  SIS; 

U9  :=  (U4  +  U8)  *  C161; 
UlO  :*  (U4  -  U8)  *  C161; 

SI  :»  U1  +  US; 

S3  :=■  U1  -  US; 

S5  :»  U3  +  U7; 

S7  :«  U3  -  U7; 

S9  :*  U2  ♦  UlO; 


Sll  :  = 
S13  :  = 
S15  :  = 
U1  :  = 
U2  :» 
U3  :  = 
U4  :  = 
U5  :» 
U6  :  = 
U7  :  = 
U8  :  = 
U9  :  = 
UlO  :« 

un  := 

S2  :  = 


U2  -  UlO; 

U6  +  U9; 

U6  -  U9; 

S4  +  S16; 

S4  -  S16; 

(S6  +  S14)  *  C161 
(S6  -  S14)  *  C161 
S8  +  S12; 

S8  -  S12; 

C162  *  (U2  -  U6); 
C163  *  U2  -  U7; 
C164  *  U6  -  U7: 

S2  +  U4; 

S2  -  U4; 

UlO  +  U8; 

UlO  •  U8: 

Ull  +  U9; 

Ull  -  U9; 

C165  *  (U1  +  US); 
U7  -  U1  *  C164; 

U7  -  US  *  C163; 
SIO  +  U3; 


UlO  +  U3 


Ull  +  U9 


OUTPUT  EQUATIONS 


R1  +  R5; 
R1  -  RS; 
R2  +  SIO 
R2  -  SIO 
R9  +  S13 


=  R8  +  S16; 

=•  R3  +  S7; 

=  R3  -  S7; 

=  R6  +  S14; 

=  R6  -  S14: 

=  Rll  -  SIS; 
=»  Rll  +  SIS; 
*  R4  -  S12; 

«  R4  ♦  S12; 

=  SI  -  S5; 

=  SI  +  S5; 

=  S2  -  RIO; 

«  S2  +  RIO; 


(*  The  procedure.  Filter  renoves  the  real  and  imaginary  components  of  the 
lowest  three  harmonics,  and  the  DC  component  from  the  16  by  16  output 
transform.  *) 

J 

procedure  Filter  (Reals,  Imag:  TwoDArray;  var  Vector:  ProType); 
begi  n 

VectorCl]  :=  Rea1s[14][14]: 

Vector[2]  ;=  Real s[14][15]; 

Vector[3]  :=  Reals[14][16]; 

Vector[4]  :=  Reals[14][l]; 

Vector[5]  :»  RealsLl4][2]; 

Vector[6]  :=  Real s[14][3]; 

Vector[7]  :=  Reals[14][4]; 

Vector[8]  :=  Reals[15]C14]; 

Vector[9]  ;«  Reals[lS][15]: 

Vector[10]  :=  Reals[lS][16]; 

Vector[ll]  :=  Reals[15][l]; 

Vector[12]  :=  Reals[15][2]: 

Vector[13]  :=  Reals[15][3]; 

Vector[14]  :=  Reals[15][4]; 

Vector[15]  :=  Reals[16][14]; 

Vector[16]  :»  Rea1s[16][15]; 

Vector[17]  :=  Reals[16][16]; 

Vector[18]  :*  Rea1s[16][l]; 

Vector[19l  :»  Reals[16][2]; 

Vector[20]  :=>  RealsCl6][3]; 

Vector[21]  :•  Rea1sC16][4]; 

Vector[22]  :=  Reals[l]Cl4]; 

VectorC23]  :=  Real s[l][15]; 

Vector[24]  :=  Rea1sCl][16]; 

Vector[25]  ;=  Rea1s[l][l]:  (*  DC  term  *) 

Vector[26]  :=  IniagCl][2]; 

Vector[27]  :=  Imag[l][3]; 

Vector[28]  :=  Imag[l][4]; 

Vector[29]  :=  Imag[2][14]; 

Vector[30]  :=  lmag[2][15]; 

Vector[31]  :=  Iraag[2][16]; 

Vector[32]  :=  Imag[2][l]; 

Vector[33]  :=  Imag[2][2]; 

Vector[34]  ;=  Imag[2][3]; 

Vector[35]  :=  Iniag[2][4]; 

Vector[36]  :=  Imag[3][14]; 

Vector[37]  :=  Imag[3][15]; 

Vector[38]  :=  Imag[3][16]; 

Vector[39]  ;=  Imag[3][l]; 

Vector[40]  :=  Imag[3][2]; 

Vector[41]  :=  Imag[3][3]; 


Vector[42]  :*  Iiiiag[3][4]; 

Vector[43]  :=  Imag[4][14]; 

Vector[44]  :=  Iinag[4][15]; 

Vector[45]  :=  Imag[4][16]; 

Vector[46]  :=  Iraag[4][l]; 

Vector[47]  :=  Iniag[4][2]; 

Vector[48]  :=  Itnag[4][3]; 

Vector[49]  :=  Inag[4][4]; 
end; 

(*  The  procedure.  Neighbor  computes  the  distances  between  all  prototype 
vectors  that  were  processed.  These  Include  26  lower  case  letters,  26 
upper  case  letters,  and  10  numerals.  *) 

procedure  Neighbor  (CharDFT:  AIITypes;  war  Matrix:  DistArray); 

war  I,  J,  K:  integer; 

Difference,  Distance:  real; 

begin 

for  K  :=  1  to  NumChar  do 
for  I  :*  1  to  NumChar  do 
begin 

Distance  :=  0.0; 
for  J  ;=  1  to  49  do 

Distance  :=  Distance  +  sqr(CharDFT[I][J]  -  CharDFT[K][J]); 
Matrix[K][I]  :=  sqrt  (Distance); 
writein  (K;3,  1:5) 
end 

end; 

J  **************************************************************************** J 

begin  (* —  MAIN  PROGRAM  — *) 

assign  (Infile,  ’Pixels'); 
assign  (Outfile,  ’Prototypes'); 
reset  (Infile); 
rewrite  (Outfile); 

for  I  :=  1  to  NumChar  do 
begin 

writeln(I); 

DFT  (Reals,  Imaginary); 

Filter  (Reals,  Imaginary,  Filtered); 

Normalize  (Filtered); 

CharDFT[I]  :=  Filtered 


Neighbor  (CharDFT,  Matrix); 
writein  (Outfile,  'Distance  Matrix*); 
for  K  0  to  2  do 
begin 

writein  (Outfile); 

for  I  :=  (K  *  16  +  1)  to  ((K+1)  *  16)  do 
begin 

if  (I  <  27) 

then  write  (Outfile.  chard  +  64) :6); 
if  (I  >  26)  and  (I  <  53) 
then  write  (Outfile.  chard  +  70) :6); 
if  (I  >  52) 

then  write  (Outfile.  charCI  -  5);6) 
end; 

writein  (Outfile); 
for  I  1  to  NumChar  do 
begin 

if  (I  <  27) 

then  write  (Outfile,  chard  +  64)); 
if  (I  >  26)  and  (I  <  53) 
then  write  (Outfile,  chard  +  70)); 
if  d  >  52) 

then  write  (Outfile,  chard  -  5)); 
for  J  :=  (K  *  16  +  1)  to  {(K+1)  *  16)  do 
write  (Outfile,  Matrix[I][J]:6:2); 
writein  (Outfile) 
end 

end; 

writein  (Outfile); 
for  I  :=  49  to  NumChar  do 
begin 

if  (I  >  26)  and  (1  <  53) 

then  write  (Outfile,  chard  +  70);6); 

if  (I  >  52) 

then  write  (Outfile,  chard  -  5):6) 
end; 

writein  (Outfile); 
for  I  :=  1  to  NumChar  do 
begin 

if  (I  <  27) 

then  write  (Outfile,  chard  +  64)); 
if  (I  >  26)  and  (I  <  53) 
then  write  (Outfile,  chard  +  70)); 
if  (I  >  52) 

then  write  (Outfile,  chard  -  5)1; 
for  J  ; 3  49  to  NumChar  do 
write  (Outfile,  Hatrix[I][J]:6:2); 
writein  (Outfile) 
end; 


writein  (Outfile); 


for  I  :=  1  to  NutnChar  do 
begin 

if  (I  <  27) 

then  writein  (Outfile, ‘Vectors  for  the  letter:' .chard  +  64)) 
if  (I  >  26)  and  (I  <  53) 

then  writeln  (Outfile,' Vectors  for  the  letter: ’ ,char( I  +  70)) 
if  (I  >  52) 

then  writeln  (Outfile, 'Vectors  for  the  number:' .chard  -  4)); 
for  J  :=  1  to  7  do 
begi  n 

Line  :=  (J  -  1)  *  7; 
for  K  :=  1  to  7  do 

write  (Outfile.  CharDFT[I][Line  +  K]:7:2); 
writeln  (Outfile) 
end; 

writeln  (Outfile) 
end; 


close  (Outfile) 
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